GameplayQA: Een Benchmarking Framework voor Besluitintensief POV-Gesynchroniseerd Multivideo-begrip van 3D Virtuele Agenten

Samenvatting

Multimodale LLM's worden steeds vaker ingezet als perceptuele ruggengraat voor autonome agents in 3D-omgevingen, van robotica tot virtuele werelden. Deze toepassingen vereisen dat agents snelle statusveranderingen waarnemen, acties aan de juiste entiteiten toeschrijven en kunnen redeneren over gelijktijdig gedrag van meerdere agents vanuit een first-person perspectief; capaciteiten die door bestaande benchmarks niet adequaat worden geëvalueerd. Wij introduceren GameplayQA, een raamwerk voor het evalueren van agent-gerichte perceptie en redenering door middel van videobegrip. Concreet annoteren we multiplayer 3D-gameplayvideo's dicht (1.22 labels/seconde) met gesynchroniseerde, gelijktijdige bijschriften van statussen, acties en gebeurtenissen, gestructureerd rond een triadisch systeem van Zelf, Andere Agents en de Wereld – een natuurlijke decompositie voor multi-agent omgevingen. Op basis van deze annotaties verfijnden we 2.4K diagnostische QA-paren, ingedeeld in drie niveaus van cognitieve complexiteit, vergezeld van een gestructureerde taxonomie van afleiders die een fijnmazige analyse mogelijk maakt van waar modellen hallucineren. Evaluatie van state-of-the-art MLLM's toont een aanzienlijke kloof met menselijke prestaties, met veelvoorkomende fouten in temporele en cross-videolokalisatie, toeschrijving van agentrollen en het verwerken van de beslissingsdichtheid van het spel. Wij hopen dat GameplayQA toekomstig onderzoek op het snijvlak van embodied AI, agentische perceptie en wereldmodellering stimuleert.

English

Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.

GameplayQA: Een Benchmarking Framework voor Besluitintensief POV-Gesynchroniseerd Multivideo-begrip van 3D Virtuele Agenten

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Samenvatting

Support