GameplayQA：面向3D虚拟代理决策密集型视点同步多视频理解的基准测试框架

摘要

多模態大型語言模型正日益作為自主代理在3D環境中的感知骨幹被部署，從機器人技術到虛擬世界。這些應用要求代理能夠感知快速狀態變化、將行動歸因於正確實體，並從第一人稱視角推理並發的多代理行為，而現有基準測試未能充分評估這些能力。我們推出GameplayQA框架，透過影片理解來評估以代理為核心的感知與推理能力。具體而言，我們以每秒1.22個標註的密度對多人3D遊戲影片進行密集標註，包含以「自我、其他代理、世界」三元系統（多代理環境的自然分解框架）建構的狀態、行動與事件的時間同步並發描述。基於這些標註，我們精煉出2.4K個診斷性問答對，按認知複雜度分為三個層級，並配備結構化的干擾項分類法，可精細分析模型產生幻覺的環節。對前沿多模態大模型的評估顯示其與人類表現存在顯著差距，常見失誤包括時序與跨影片定位、代理角色歸因，以及處理遊戲決策密度等問題。我們期待GameplayQA能推動具身人工智慧、代理感知與世界建模交叉領域的未來研究。

English

Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.

GameplayQA：面向3D虚拟代理决策密集型视点同步多视频理解的基准测试框架

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

摘要

Support