실제 세계에서의 성공을 위한 합성 세계에서의 강화 학습을 통한 시각-언어 모델 훈련 개선

초록

상호작용적 다중모드 에이전트는 원시 시각 관측을 언어 조건화된 행동의 일관된 시퀀스로 변환해야 하는데, 이는 현재의 시각-언어 모델(VLMs)이 아직 갖추지 못한 능력이다. 이론적으로는 초기의 강화학습(RL) 접근법이 VLMs에 이러한 기술을 부여할 수 있지만, 학습된 행동이 훈련 시뮬레이터를 넘어 일반화되는지 거의 테스트되지 않았으며, 취약한 하이퍼파라미터 조정이나 상태 변동성이 낮은 밀집 보상 환경에 의존한다. 우리는 Vision-Language Decoupled Actor-Critic (VL-DAC)이라는 경량화되고 하이퍼파라미터가 없는 RL 알고리즘을 소개한다. VL-DAC은 행동 토큰에 PPO 업데이트를 적용하면서 환경 단계 수준에서만 가치를 학습하는데, 이는 우리가 아는 한 대규모 VLMs 또는 LLMs에 대해 이전에 탐구되지 않은 방식이다. 이 간단한 분리는 불안정한 가중치 항목을 제거하고 더 빠르고 안정적인 수렴을 이끈다. VL-DAC으로 하나의 저렴한 시뮬레이터(MiniWorld, Gym-Cards, ALFWorld, 또는 WebShop)에서 단일 VLM을 훈련시키는 것만으로도 널리 일반화되는 정책을 생성한다: BALROG(게임 중심 에이전트 제어)에서 +50%, VSI-Bench(공간 계획)의 가장 어려운 부분에서 +5%, VisualWebBench(웹 탐색)에서 +2%의 상대적 성능 향상을 달성하며, 일반적인 이미지 이해 정확도는 저하되지 않는다. 이러한 결과는 간단한 RL 알고리즘이 저렴한 합성 세계에서 VLMs을 완전히 훈련시키면서 실제 이미지 에이전트 제어, 공간 추론, 웹 탐색 벤치마크에서 측정 가능한 성과를 제공할 수 있다는 첫 번째 증거를 제공한다.

English

Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50\% relative on BALROG (game-centric agentic control), +5\% relative on the hardest part of VSI-Bench (spatial planning), and +2\% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.

실제 세계에서의 성공을 위한 합성 세계에서의 강화 학습을 통한 시각-언어 모델 훈련 개선

Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

초록

Support