在合成世界中运用强化学习提升视觉-语言模型训练,助力现实世界成功
Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success
August 6, 2025
作者: George Bredis, Stanislav Dereka, Viacheslav Sinii, Ruslan Rakhimov, Daniil Gavrilov
cs.AI
摘要
交互式多模态智能体需将原始视觉观察转化为连贯的语言条件动作序列——这一能力当前视觉语言模型(VLMs)尚不具备。早期的强化学习(RL)尝试理论上可为VLMs赋予此类技能,但鲜有验证所学行为能否超越其训练模拟器的泛化能力,且依赖于脆弱的超参数调整或状态变化有限的密集奖励环境。我们提出了视觉语言解耦演员-评论家(VL-DAC),一种轻量级、无需超参数的RL算法。VL-DAC在动作令牌上应用PPO更新,同时仅在环境步长层面学习价值:据我们所知,这种安排尚未在大型VLMs或LLMs中探索过。这种简单的解耦消除了不稳定的权重项,带来了更快、更可靠的收敛。使用VL-DAC在单一低成本模拟器(如MiniWorld、Gym-Cards、ALFWorld或WebShop)中训练单个VLM,已能生成广泛泛化的策略:在BALROG(游戏中心智能控制)上相对提升+50%,在VSI-Bench最困难部分(空间规划)上相对提升+5%,在VisualWebBench(网页导航)上提升+2%,且均未降低通用图像理解准确率。这些结果首次证明,一个简单的RL算法能在廉价合成世界中完全训练VLMs,同时在真实图像的智能控制、空间推理及网页导航基准上带来可衡量的提升。
English
Interactive multimodal agents must convert raw visual observations into
coherent sequences of language-conditioned actions -- a capability that current
vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL)
efforts could, in principle, endow VLMs with such skills, but they have seldom
tested whether the learned behaviours generalize beyond their training
simulators, and they depend either on brittle hyperparameter tuning or on
dense-reward environments with low state variability. We introduce
Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight,
hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens
while learning value only at the environment-step level: an arrangement, to our
knowledge, not previously explored for large VLMs or LLMs. This simple
decoupling removes unstable weighting terms and yields faster, more reliable
convergence. Training a single VLM with VL-DAC in one inexpensive simulator at
a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies
that generalize widely: +50\% relative on BALROG (game-centric agentic
control), +5\% relative on the hardest part of VSI-Bench (spatial planning),
and +2\% on VisualWebBench (web navigation), all without degrading general
image understanding accuracy. These results provide the first evidence that a
simple RL algorithm can train VLMs entirely in cheap synthetic worlds while
delivering measurable gains on real-image agentic, spatial-reasoning, and
web-navigation benchmarks.