視覺-語言-動作-評論家模型於機器人現實世界強化學習之應用
A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning
September 19, 2025
作者: Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, Jiangmiao Pang
cs.AI
摘要
基於視覺-語言-動作(VLA)模型的機器人現實世界強化學習(RL)面臨著稀疏、手工設計的獎勵和低效探索的瓶頸。我們引入了VLAC,這是一個基於InternVL構建並在大規模異構數據集上訓練的通用過程獎勵模型。給定成對的觀察和語言目標,它輸出密集的進度增量與完成信號,消除了任務特定的獎勵工程,並支持對未見任務和環境的一次性上下文轉移。VLAC在視覺-語言數據集上訓練,以增強感知、對話和推理能力,同時結合機器人與人類軌跡數據,這些數據為動作生成和進度估計提供了基礎,並通過構建大量負樣本和語義不匹配樣本進一步強化,以拒絕不相關的提示並檢測退步或停滯。通過提示控制,單一VLAC模型交替生成獎勵和動作令牌,統一了評價者與策略。在異步現實世界RL循環中部署時,我們分層實施了分級的人類介入協議(離線示範回放、返回與探索、人類引導探索),這加速了探索並穩定了早期學習。在四項不同的現實世界操作任務中,VLAC在200次現實世界交互回合內將成功率從約30%提升至約90%;結合人類介入干預,樣本效率進一步提高了50%,並實現了高達100%的最終成功率。
English
Robotic real-world reinforcement learning (RL) with vision-language-action
(VLA) models is bottlenecked by sparse, handcrafted rewards and inefficient
exploration. We introduce VLAC, a general process reward model built upon
InternVL and trained on large scale heterogeneous datasets. Given pairwise
observations and a language goal, it outputs dense progress delta and done
signal, eliminating task-specific reward engineering, and supports one-shot
in-context transfer to unseen tasks and environments. VLAC is trained on
vision-language datasets to strengthen perception, dialogic and reasoning
capabilities, together with robot and human trajectories data that ground
action generation and progress estimation, and additionally strengthened to
reject irrelevant prompts as well as detect regression or stagnation by
constructing large numbers of negative and semantically mismatched samples.
With prompt control, a single VLAC model alternately generating reward and
action tokens, unifying critic and policy. Deployed inside an asynchronous
real-world RL loop, we layer a graded human-in-the-loop protocol (offline
demonstration replay, return and explore, human guided explore) that
accelerates exploration and stabilizes early learning. Across four distinct
real-world manipulation tasks, VLAC lifts success rates from about 30\% to
about 90\% within 200 real-world interaction episodes; incorporating
human-in-the-loop interventions yields a further 50% improvement in sample
efficiency and achieves up to 100% final success.