GameplayQA：面向3D虚拟智能体决策密集型视点同步多视频理解的基准测试框架

摘要

多模态大语言模型正日益作为感知核心，被部署在从机器人技术到虚拟世界的三维环境自主智能体中。这类应用要求智能体能够感知快速的状态变化、将动作正确归因于对应实体，并从第一人称视角推理并发多智能体行为——这些能力是现有基准测试未能充分评估的。我们推出GameplayQA框架，通过视频理解来评估以智能体为中心的感知与推理能力。具体而言，我们在多人3D游戏视频中以每秒1.22个标签的密度进行标注，同步记录围绕"自我-其他智能体-环境"三元体系的状态、动作和事件并发描述，这种分解方式天然契合多智能体环境。基于这些标注，我们提炼出2.4K个诊断性问答对，按认知复杂度分为三个层级，并构建了结构化干扰项分类法，可精细分析模型的幻觉产生环节。对前沿多模态大模型的评估显示，其在时间定位与跨视频关联、智能体角色归因、以及游戏决策密度处理等方面与人类表现存在显著差距。我们期待GameplayQA能推动具身人工智能、智能体感知与世界建模交叉领域的未来研究。

English

Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.

GameplayQA：面向3D虚拟智能体决策密集型视点同步多视频理解的基准测试框架

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

摘要

Support