ChatPaper.aiChatPaper

通过预测强化行动策略

Reinforcing Action Policies by Prophesying

November 25, 2025
作者: Jiahui Zhang, Ze Huang, Chun Gu, Zipei Ma, Li Zhang
cs.AI

摘要

视觉-语言-动作(VLA)策略在语言、感知与机器人控制的协同方面表现出色。然而,大多数VLA仅通过模仿学习进行训练,这会导致对演示数据的过拟合,并在分布偏移时表现脆弱。强化学习(RL)通过直接优化任务奖励来解决这一错位问题,但真实机器人交互成本高昂,且传统仿真器难以构建和迁移。我们通过学得的世界模型和专为基于流的动作头设计的RL流程,同步解决了VLA后训练中的数据效率与优化稳定性问题。具体而言,我们提出Prophet——一种基于大规模异构机器人数据预训练的统一动作到视频机器人驱动模型,可学习可重用的动作-结果动态关系。该模型能够快速适应新的机器人、物体及环境,形成可直接用于推演的仿真器。在此基础上,我们通过适配VLA动作的Flow-action-GRPO(FA-GRPO)算法,以及对流动作头梯度进行逐步重标定的FlowScale技术,强化动作策略。Prophet、FA-GRPO与FlowScale共同构成ProphRL框架,为VLA后训练提供了一条实用且兼顾数据与计算效率的路径。实验表明,该框架在公开基准上使不同VLA变体的成功率提升5-17%,在真实机器人任务中提升24-30%。
English
Vision-Language-Action (VLA) policies excel in aligning language, perception, and robot control. However, most VLAs are trained purely by imitation, which overfits to demonstrations, and is brittle under distribution shift. Reinforcement learning (RL) directly optimizes task reward and thus addresses this misalignment, but real-robot interaction is expensive and conventional simulators are hard to engineer and transfer. We address both data efficiency and optimization stability in VLA post-training via a learned world model and an RL procedure tailored to flow-based action heads. Specifically, we introduce Prophet, a unified action-to-video robot actuation pretrained across large-scale, heterogeneous robot data to learn reusable action-outcome dynamics. It is able to few-shot adapt to new robots, objects, and environments, yielding a rollout-ready simulator. Upon Prophet, we reinforce action policies with Flow-action-GRPO (FA-GRPO), which adapts Flow-GRPO to operate on VLA actions, and with FlowScale, a stepwise reweighting that rescales per-step gradients in the flow head. Together, Prophet, FA-GRPO, and FlowScale constitute ProphRL, a practical, data- and compute-efficient path to VLA post-training. Experiments show 5-17% success gains on public benchmarks and 24-30% gains on real robots across different VLA variants.
PDF22December 1, 2025