OmniGameArena：一個用於具備改進動態之 VLM 遊戲代理的統一 UE5 基準測試

摘要

視覺語言模型（VLM）代理正逐漸部署於互動式遊戲環境中。然而，現有的VLM代理遊戲基準通常僅報告每個（代理、遊戲）配對的單次首次嘗試分數，專注於單一代理的單人模式，且缺乏統一協議來評估異質代理類別（商用VLM、開源權重VLM及專用遊戲策略）於相同基準下。我們針對這些缺口提出OmniGameArena，這是一個即時基準，包含十二個全新建構的Unreal Engine 5遊戲，涵蓋單人（7個）、玩家對戰（3個）與合作模式（2個），並採用統一行動介面；以及改進動態曲線（IDC），這是一個代理反思框架，其中使用工具的大型語言模型反思器會在多輪中自主精煉有界限的技能提示。除了冷啟動排行榜分數外，IDC還為每個（代理、遊戲）配對揭露兩個額外可觀察指標：分數在反思輪次中的演變情況，以及學習到的技能在保留任務變體上的表現。我們報告了十二個VLM代理在冷啟動排行榜上的這些可觀察指標，以及四個頂尖代理在IDC下的表現。

English

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.