MA-EgoQA:基于多智能体具身视角视频的问答系统
MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents
March 10, 2026
作者: Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju Hwang
cs.AI
摘要
随着具身模型日益强大,未来人类将在工作场所或家庭中与多个具身AI智能体协同工作。为确保人类用户与多智能体系统间更顺畅的沟通,关键要实现对多智能体并行输入信息的解析,并为每个查询匹配相应上下文。现有挑战包括:如何有效压缩并传递以视频形式存在的大量个体感知输入,以及如何正确聚合多个第一视角视频以构建系统级记忆。本研究首次正式定义了"同时理解来自具身智能体的多个长时序第一视角视频"这一新问题。为推进该方向研究,我们提出了MultiAgent-EgoQA(MA-EgoQA)基准测试,用于系统评估现有模型在此场景下的表现。该基准包含1.7万个专为多视角视频流设计的独特问题,涵盖社交互动、任务协调、心理理论、时序推理和环境交互五大类别。我们进一步提出名为EgoMAS的简易基线模型,通过共享记忆机制与智能体间动态检索技术实现多智能体协同。通过对MA-EgoQA上多种基线模型及EgoMAS的综合评估,发现现有方法难以有效处理多路第一视角视频流,这凸显了未来在跨智能体系统级理解方面取得突破的必要性。代码与基准测试数据已发布于https://ma-egoqa.github.io。
English
As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA-EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent-wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA-EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across the agents. The code and benchmark are available at https://ma-egoqa.github.io.