VisGym：面向多模態代理的多樣化、可自訂、可擴展環境

摘要

現代視覺語言模型在多步驟視覺互動中的特性仍未被充分探討，特別是在長時序中如何整合感知、記憶與行動的機制。我們推出VisGym——一個包含17種測試環境的訓練場，用於評估與訓練視覺語言模型。該套件涵蓋符號推理謎題、真實圖像理解、導航及操作任務，並提供對難度級別、輸入表徵、規劃時長與反饋機制的靈活控制。我們同時提供能生成結構化示範的多步驟求解器，以實現監督式微調。評估結果顯示，所有前沿模型在互動情境中均表現不佳，在簡單配置（46.6%）與困難配置（26.0%）下的成功率均偏低。實驗揭示明顯侷限性：模型難以有效利用長上下文，使用無限制歷史記錄的表現反而比截斷窗口更差。此外，我們發現若干基於文本的符號任務在轉化為視覺形式後難度顯著提升。然而，在部分可觀測或動態未知的設定中，透過明確的目標觀察、文本反饋以及探索性示範進行監督式微調，能帶來持續性效能提升，這為改進多步驟視覺決策指明了具體失效模式與優化路徑。相關程式碼、資料與模型可參見：https://visgym.github.io/。

English

Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.

VisGym：面向多模態代理的多樣化、可自訂、可擴展環境

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

摘要

Support