奥德修斯：通过强化学习将视觉语言模型扩展至游戏中的百轮以上决策

摘要

随着视觉语言模型（VLM）能力的飞速发展，将其扩展至电子游戏等交互式决策任务已成为前景广阔的研究方向。然而现有方法要么依赖基于人类游戏轨迹的大规模监督微调，要么仅在较短决策跨度（通常为20-30步）中应用强化学习。本研究聚焦于基于强化学习的VLM训练方法，旨在实现《超级马里奥大陆》这一视觉场景中长跨度决策任务——该环境需进行100+步交互，并要求协调感知、推理与行动能力。我们首先系统性地探究了关键算法组件，提出配备轻量级步级评论家的PPO算法变体，相比GRPO和Reinforce++等无评论家方法，该方案显著提升了训练稳定性与样本效率。研究进一步表明，与从零开始训练的传统深度强化学习相比，预训练VLM能提供强动作先验，在强化学习训练期间显著提升样本效率，并减少动作工程等人工设计需求。基于这些发现，我们推出了Odysseus开放式VLM智能体训练框架，该框架在游戏多个关卡中实现显著进展，平均游戏进度达到前沿模型的3倍以上。此外，训练后的模型在游戏内与跨游戏泛化场景下均表现出持续改进的性能，同时保持通用领域能力。总体而言，我们的研究明确了在多模态长跨度决策任务中实现稳定高效强化学习的关键要素，为开发具身化VLM智能体提供了实用指导。

English

Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We further show that pretrained VLMs provide strong action priors, significantly improving sample efficiency during RL training and reducing the need for manual design choices such as action engineering, compared to classical deep RL trained from scratch. Building on these insights, we introduce Odysseus, an open training framework for VLM agents, achieving substantial gains across multiple levels of the game and at least 3 times average game progresses than frontier models. Moreover, the trained models exhibit consistent improvements under both in-game and cross-game generalization settings, while maintaining general-domain capabilities. Overall, our results identify key ingredients for making RL stable and effective in long-horizon, multi-modal settings, and provide practical guidance for developing VLMs as embodied agents.

奥德修斯：通过强化学习将视觉语言模型扩展至游戏中的百轮以上决策

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

摘要

Support