星盘导航：面向蒸馏自回归视频模型的前向过程强化学习引导

摘要

蒸馏自回归（AR）视频模型能够实现高效的流式生成，但常常与人类视觉偏好存在偏差。现有的强化学习（RL）框架难以自然适配这类架构，通常需要昂贵的再蒸馏过程或耦合求解器的逆向过程优化，从而引入显著的内存与计算开销。我们提出了Astrolabe——一种专为蒸馏AR模型设计的高效在线RL框架。为突破现有瓶颈，我们引入了基于负向感知微调的正向过程RL建模。通过直接在推理终点对比正负样本，该方法无需展开逆向过程即可建立隐式的策略改进方向。为实现长视频的对齐扩展，我们提出流式训练方案：通过滚动KV缓存渐进生成序列，仅在局部片段窗口应用RL更新，同时以前置上下文为条件保障长程连贯性。最后，为抑制奖励破解现象，我们整合了由不确定性感知选择性正则化与动态参考更新稳定的多奖励目标。大量实验表明，本方法能持续提升多种蒸馏AR视频模型的生成质量，成为一种鲁棒且可扩展的对齐解决方案。

English

Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.