面向机器人现实世界强化学习的视觉-语言-动作-评论家模型
A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning
September 19, 2025
作者: Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, Jiangmiao Pang
cs.AI
摘要
基于视觉-语言-动作(VLA)模型的机器人现实世界强化学习(RL)常受限于稀疏、手工设计的奖励机制及低效的探索策略。为此,我们提出了VLAC,一种建立在InternVL基础之上、通过大规模异构数据集训练得到的通用进程奖励模型。该模型在给定成对观察结果及语言目标时,能输出密集的进程增量及完成信号,从而免除了针对特定任务的奖励工程,并支持对未见任务和环境的一次性上下文迁移。VLAC通过视觉-语言数据集训练,强化了感知、对话及推理能力,同时结合机器人及人类轨迹数据,为动作生成与进程评估提供基础,并通过构建大量负面及语义不匹配样本,进一步增强了模型对无关提示的拒绝能力以及对退步或停滞的检测能力。借助提示控制,单一VLAC模型可交替生成奖励与动作标记,实现了评价与策略的统一。在异步现实世界RL循环中部署时,我们采用了一种分层次的人机协作协议(离线演示回放、回报与探索、人类引导探索),以加速探索并稳定早期学习。在四项不同的现实世界操作任务中,VLAC在200次现实世界交互周期内将成功率从约30%提升至约90%;引入人机协作干预后,样本效率进一步提升了50%,并实现了高达100%的最终成功率。
English
Robotic real-world reinforcement learning (RL) with vision-language-action
(VLA) models is bottlenecked by sparse, handcrafted rewards and inefficient
exploration. We introduce VLAC, a general process reward model built upon
InternVL and trained on large scale heterogeneous datasets. Given pairwise
observations and a language goal, it outputs dense progress delta and done
signal, eliminating task-specific reward engineering, and supports one-shot
in-context transfer to unseen tasks and environments. VLAC is trained on
vision-language datasets to strengthen perception, dialogic and reasoning
capabilities, together with robot and human trajectories data that ground
action generation and progress estimation, and additionally strengthened to
reject irrelevant prompts as well as detect regression or stagnation by
constructing large numbers of negative and semantically mismatched samples.
With prompt control, a single VLAC model alternately generating reward and
action tokens, unifying critic and policy. Deployed inside an asynchronous
real-world RL loop, we layer a graded human-in-the-loop protocol (offline
demonstration replay, return and explore, human guided explore) that
accelerates exploration and stabilizes early learning. Across four distinct
real-world manipulation tasks, VLAC lifts success rates from about 30\% to
about 90\% within 200 real-world interaction episodes; incorporating
human-in-the-loop interventions yields a further 50% improvement in sample
efficiency and achieves up to 100% final success.