オデュッセウス：強化学習によるゲーム内100+ターンの意思決定へのVLMスケーリング

要旨

視覚言語モデル（VLM）の急速に発展する能力を背景に、ビデオゲームのような対話型意思決定タスクへの応用が新たなフロンティアとして注目されている。しかし、既存のアプローチは人間のプレイ軌跡に基づく大規模な教師ありファインチューニング（SFT）に依存するか、あるいは比較的短期間（通常20～30ターン程度）の設定で強化学習（RL）を適用するに留まっている。本研究では、視覚的基盤を有する環境であり、100ターン以上にわたる協調的な知覚・推論・行動を要する『スーパーマリオランド』において、長期的意思決定のためのVLMをRLで訓練する手法を検討する。まず主要なアルゴリズム構成要素を体系的に調査し、軽量なターン単位批評家を備えたPPOの改良版を提案する。これにより、GRPOやReinforce++などの批評家非使用手法と比較して、訓練の安定性とサンプル効率が大幅に向上する。さらに、事前学習済みVLMが強力な行動事前分布を提供し、スクラッチから訓練する従来の深層RLと比べて、RL訓練中のサンプル効率が大幅に向上し、行動設計のような手動での設計選択が軽減されることを示す。これらの知見に基づき、VLMエージェントのためのオープンな訓練フレームワーク「Odysseus」を開発し、ゲームの複数レベルで大幅な性能向上を達成、先端モデルと比較して平均ゲーム進行度が少なくとも3倍以上となった。さらに、訓練済みモデルは、ゲーム内およびゲーム間の一般化設定の双方で一貫した改善を示しつつ、一般領域の能力も維持している。総じて、本研究は長期的・マルチモーダル環境においてRLを安定かつ効果的に機能させるための核心的要素を明らかにし、具身化エージェントとしてのVLM開発に対する実践的な指針を提供する。

English

Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We further show that pretrained VLMs provide strong action priors, significantly improving sample efficiency during RL training and reducing the need for manual design choices such as action engineering, compared to classical deep RL trained from scratch. Building on these insights, we introduce Odysseus, an open training framework for VLM agents, achieving substantial gains across multiple levels of the game and at least 3 times average game progresses than frontier models. Moreover, the trained models exhibit consistent improvements under both in-game and cross-game generalization settings, while maintaining general-domain capabilities. Overall, our results identify key ingredients for making RL stable and effective in long-horizon, multi-modal settings, and provide practical guidance for developing VLMs as embodied agents.

オデュッセウス：強化学習によるゲーム内100+ターンの意思決定へのVLMスケーリング

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

要旨

Support