VisGym：面向多模态智能体的多样化、可定制、可扩展环境

摘要

现代视觉语言模型（VLM）在多步视觉交互中的特性仍不明确，尤其是在长时程任务中如何整合感知、记忆与行动方面。我们推出VisGym——一个包含17个测试环境的训练场，用于评估和训练VLM。该套件涵盖符号推理、实景图像理解、导航与操作任务，并提供对难度级别、输入表征、规划时域和反馈机制的灵活控制。我们还提供可生成结构化演示的多步求解器，支持监督式微调。评估表明，所有前沿模型在交互场景中表现均不理想，在简单（46.6%）与困难（26.0%）配置下的成功率均较低。实验揭示出显著局限：模型难以有效利用长上下文，无限制历史窗口的表现反而差于截断窗口；此外，多个基于文本的符号任务在视觉化呈现后难度显著提升。然而，在部分可观测或动态未知场景中，通过显式目标观察、文本反馈和探索性演示进行监督微调可带来稳定提升，这为改进多步视觉决策指明了具体失效模式与优化路径。代码、数据及模型详见：https://visgym.github.io/。

English

Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.

VisGym：面向多模态智能体的多样化、可定制、可扩展环境

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

摘要

Support