GameplayQA: 3D仮想エージェントの意思決定密度が高いPOV同期マルチビデオ理解のためのベンチマークフレームワーク

要旨

マルチモーダル大規模言語モデル（MLLM）は、ロボティクスから仮想世界まで、3D環境における自律エージェントの知覚基盤としてますます導入されています。こうした応用では、エージェントが急速な状態変化を感知し、行動を正しい主体に帰属させ、一人称視点で同時発生するマルチエージェントの行動を推論する能力が求められますが、既存のベンチマークはこれらの能力を適切に評価できていません。本研究では、映像理解を通じてエージェント中心の知覚と推論を評価する枠組み「GameplayQA」を提案します。具体的には、マルチプレイヤー3Dゲームプレイ動画に対し、1.22ラベル/秒の高密度注釈を付与しました。これには、状態・行動・イベントの同時発生キャプションを時間同期させ、「自己」「他のエージェント」「世界」という三者構造で整理しています。これはマルチエージェント環境における自然な分解表現です。これらの注釈から、3段階の認知複雑度に分類された2,400組の診断的QAペアを精選し、モデルの幻覚発生箇所を詳細に分析可能な構造化されたディストラクタ（誤答選択肢）分類体系を整備しました。最先端MLLMの評価では、時間的・映像間の接地、エージェント役割の帰属、ゲームの意思決定密度への対応において、人間の性能から顕著な隔たりが明らかになりました。GameplayQAが、具身AI、エージェント知覚、世界モデリングの交差点における将来の研究を刺激することを期待します。

English

Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.

GameplayQA: 3D仮想エージェントの意思決定密度が高いPOV同期マルチビデオ理解のためのベンチマークフレームワーク

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

要旨

Support