在合成世界中使用強化學習增強視覺語言模型訓練,以實現現實世界的成功
Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success
August 6, 2025
作者: George Bredis, Stanislav Dereka, Viacheslav Sinii, Ruslan Rakhimov, Daniil Gavrilov
cs.AI
摘要
交互式多模态代理必须将原始视觉观察转化为连贯的语言条件动作序列——这是当前视觉语言模型(VLMs)尚不具备的能力。早期的强化学习(RL)方法原则上可以为VLMs赋予此类技能,但它们很少测试所学行为是否能推广到训练模拟器之外,且依赖于脆弱的超参数调整或状态变化性低的密集奖励环境。我们提出了视觉语言解耦演员-评论家(VL-DAC),一种轻量级、无需超参数的RL算法。VL-DAC在动作标记上应用PPO更新,同时仅在环境步骤层面学习价值:据我们所知,这种安排尚未在大型VLMs或LLMs中探索过。这种简单的解耦消除了不稳定的权重项,带来了更快、更可靠的收敛。在单个廉价模拟器(MiniWorld、Gym-Cards、ALFWorld或WebShop)中依次训练一个VLM,已能产生广泛泛化的策略:在BALROG(以游戏为中心的代理控制)上相对提升+50%,在VSI-Bench(空间规划)最困难部分上相对提升+5%,在VisualWebBench(网页导航)上提升+2%,且均未降低一般图像理解的准确性。这些结果首次证明,一个简单的RL算法可以在廉价的合成世界中完全训练VLMs,同时在真实图像的代理控制、空间推理和网页导航基准上带来可衡量的提升。
English
Interactive multimodal agents must convert raw visual observations into
coherent sequences of language-conditioned actions -- a capability that current
vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL)
efforts could, in principle, endow VLMs with such skills, but they have seldom
tested whether the learned behaviours generalize beyond their training
simulators, and they depend either on brittle hyperparameter tuning or on
dense-reward environments with low state variability. We introduce
Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight,
hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens
while learning value only at the environment-step level: an arrangement, to our
knowledge, not previously explored for large VLMs or LLMs. This simple
decoupling removes unstable weighting terms and yields faster, more reliable
convergence. Training a single VLM with VL-DAC in one inexpensive simulator at
a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies
that generalize widely: +50\% relative on BALROG (game-centric agentic
control), +5\% relative on the hardest part of VSI-Bench (spatial planning),
and +2\% on VisualWebBench (web navigation), all without degrading general
image understanding accuracy. These results provide the first evidence that a
simple RL algorithm can train VLMs entirely in cheap synthetic worlds while
delivering measurable gains on real-image agentic, spatial-reasoning, and
web-navigation benchmarks.