ChatPaper.aiChatPaper

天球儀:引導蒸餾式自迴歸影片模型的前向過程強化學習 (注:Astrolabe在此語境下採用"天球儀"的意譯,既保留古代導航儀器的隱喻,又契合其"引導方向"的技術功能。標題採用主副標題結構,將技術術語"Distilled Autoregressive Video Models"精確譯為"蒸餾式自迴歸影片模型",符合AI領域術語規範。)

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

March 17, 2026
作者: Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, Anyi Rao
cs.AI

摘要

蒸餾自迴歸影片模型雖能實現高效串流生成,卻常與人類視覺偏好產生偏差。現有強化學習框架難以自然適配此類架構,通常需依賴昂貴的重新蒸餾或耦合求解器的反向過程優化,導致顯著的記憶體與計算開銷。我們提出專為蒸餾自迴歸模型設計的高效線上強化學習框架Astrolabe。為突破現有瓶頸,我們基於負向感知微調技術,提出一種正向過程強化學習架構。透過在推理端點直接對比正負樣本,該方法無需展開反向過程即可建立隱式策略改進方向。為實現長影片對齊,我們設計了串流訓練機制:透過滾動KV快取逐步生成序列,僅對局部片段窗口應用強化學習更新,同時以歷史上下文為條件確保長程連貫性。最後為抑制獎勵破解,我們整合了由不確定性感知選擇性正則化與動態參考更新穩定的多獎勵目標。大量實驗表明,本方法能持續提升多種蒸餾自迴歸影片模型的生成品質,成為具可擴展性的穩健對齊方案。
English
Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.
PDF834March 24, 2026