合成世界における強化学習を用いたビジョン・ランゲージモデルトレーニングの強化による実世界での成功

要旨

インタラクティブなマルチモーダルエージェントは、生の視覚観測を言語条件付き行動の一貫したシーケンスに変換する必要があるが、これは現在の視覚言語モデル（VLM）がまだ持っていない能力である。従来の強化学習（RL）の取り組みは、原理的にはVLMにそのようなスキルを付与することが可能であったが、学習された行動がトレーニングシミュレータを超えて一般化するかどうかをほとんど検証しておらず、脆弱なハイパーパラメータ調整または状態変動が少ない密な報酬環境に依存していた。本研究では、軽量でハイパーパラメータフリーのRLアルゴリズムであるVision-Language Decoupled Actor-Critic（VL-DAC）を提案する。VL-DACは、行動トークンにPPO更新を適用しながら、環境ステップレベルでのみ価値を学習する。この配置は、我々の知る限り、大規模なVLMやLLMに対してこれまでに探索されていないものである。この単純な分離により、不安定な重み付け項が除去され、より速く、より信頼性の高い収束が得られる。VL-DACを使用して、1つの安価なシミュレータ（MiniWorld、Gym-Cards、ALFWorld、またはWebShop）で単一のVLMをトレーニングするだけで、広範に一般化するポリシーが生成される。BALROG（ゲーム中心のエージェント制御）では相対的に+50%、VSI-Benchの最も難しい部分（空間計画）では相対的に+5%、VisualWebBench（ウェブナビゲーション）では+2%の向上が得られ、一般的な画像理解精度を低下させることなく達成された。これらの結果は、単純なRLアルゴリズムが安価な合成世界で完全にVLMをトレーニングし、実画像のエージェント制御、空間推論、ウェブナビゲーションのベンチマークで測定可能な向上をもたらすことができる最初の証拠を提供する。

English

Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50\% relative on BALROG (game-centric agentic control), +5\% relative on the hardest part of VSI-Bench (spatial planning), and +2\% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.

合成世界における強化学習を用いたビジョン・ランゲージモデルトレーニングの強化による実世界での成功

Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

要旨

Support