ロボットの実世界強化学習のための視覚・言語・行動・批評モデル

要旨

ビジョン・ランゲージ・アクション（VLA）モデルを用いたロボットの実世界強化学習（RL）は、疎な手作り報酬と非効率的な探索によってボトルネックとなっている。本研究では、InternVLを基盤とし、大規模な異種データセットで訓練された汎用的なプロセス報酬モデルであるVLACを提案する。ペアワイズ観測と言語目標が与えられると、VLACは密な進捗差分と完了信号を出力し、タスク固有の報酬設計を不要とし、未見のタスクや環境へのワンショット・インコンテキスト転移をサポートする。VLACは、知覚、対話、推論能力を強化するためにビジョン・ランゲージデータセットで訓練され、アクション生成と進捗推定を基盤とするロボットおよび人間の軌跡データとともに、無関係なプロンプトを拒否し、回帰や停滞を検出するために大量のネガティブおよび意味的に不一致なサンプルを構築することでさらに強化されている。プロンプト制御により、単一のVLACモデルが報酬とアクショントークンを交互に生成し、批評家とポリシーを統合する。非同期の実世界RLループ内に展開し、段階的な人間介入プロトコル（オフライン実演再生、リターンと探索、人間ガイド付き探索）を重ねることで、探索を加速し、初期学習を安定化する。4つの異なる実世界操作タスクにおいて、VLACは約200回の実世界インタラクションエピソード内で成功率を約30％から約90％に引き上げ、人間介入を取り入れることでサンプル効率がさらに50％向上し、最終的に100％の成功率を達成する。

English

Robotic real-world reinforcement learning (RL) with vision-language-action (VLA) models is bottlenecked by sparse, handcrafted rewards and inefficient exploration. We introduce VLAC, a general process reward model built upon InternVL and trained on large scale heterogeneous datasets. Given pairwise observations and a language goal, it outputs dense progress delta and done signal, eliminating task-specific reward engineering, and supports one-shot in-context transfer to unseen tasks and environments. VLAC is trained on vision-language datasets to strengthen perception, dialogic and reasoning capabilities, together with robot and human trajectories data that ground action generation and progress estimation, and additionally strengthened to reject irrelevant prompts as well as detect regression or stagnation by constructing large numbers of negative and semantically mismatched samples. With prompt control, a single VLAC model alternately generating reward and action tokens, unifying critic and policy. Deployed inside an asynchronous real-world RL loop, we layer a graded human-in-the-loop protocol (offline demonstration replay, return and explore, human guided explore) that accelerates exploration and stabilizes early learning. Across four distinct real-world manipulation tasks, VLAC lifts success rates from about 30\% to about 90\% within 200 real-world interaction episodes; incorporating human-in-the-loop interventions yields a further 50% improvement in sample efficiency and achieves up to 100% final success.

ロボットの実世界強化学習のための視覚・言語・行動・批評モデル

A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning

要旨

Support