VisGym:面向多模态智能体的多样化、可定制、可扩展环境
VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
January 23, 2026
作者: Zirui Wang, Junyi Zhang, Jiaxin Ge, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, XuDong Wang, Ion Stoica, David M. Chan, Sewon Min, Joseph E. Gonzalez
cs.AI
摘要
现代视觉语言模型(VLM)在多步视觉交互中的特性仍不明确,尤其是在长时程任务中如何整合感知、记忆与行动方面。我们推出VisGym——一个包含17个测试环境的训练场,用于评估和训练VLM。该套件涵盖符号推理、实景图像理解、导航与操作任务,并提供对难度级别、输入表征、规划时域和反馈机制的灵活控制。我们还提供可生成结构化演示的多步求解器,支持监督式微调。评估表明,所有前沿模型在交互场景中表现均不理想,在简单(46.6%)与困难(26.0%)配置下的成功率均较低。实验揭示出显著局限:模型难以有效利用长上下文,无限制历史窗口的表现反而差于截断窗口;此外,多个基于文本的符号任务在视觉化呈现后难度显著提升。然而,在部分可观测或动态未知场景中,通过显式目标观察、文本反馈和探索性演示进行监督微调可带来稳定提升,这为改进多步视觉决策指明了具体失效模式与优化路径。代码、数据及模型详见:https://visgym.github.io/。
English
Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.