MA-EgoQA：基於多具身智能體第一人稱視角的問答系統

摘要

隨著具身模型日益強大，未來人類將在工作場所或家庭中與多個具身人工智慧代理協作。為確保人類使用者與多代理系統之間更順暢的溝通，關鍵在於並行解讀來自多個代理的輸入資訊，並針對每個查詢參照相應的情境脈絡。現有挑戰包括：如何有效壓縮並傳遞以影片形式呈現的大量個體感官輸入，以及如何正確聚合多個以自我為中心的視角影片以建構系統級記憶。本研究首先正式定義了一個新問題——同時理解從具身代理收集的多段長時程第一視角影片。為推動此方向的研究，我們提出MultiAgent-EgoQA（MA-EgoQA）基準測試，旨在系統性評估現有模型在我們設想情境中的表現。MA-EgoQA包含1,700道專為多視角串流量身設計的問題，涵蓋五大類別：社交互動、任務協調、心智理論、時間推理與環境互動。我們進一步提出名為EgoMAS的簡易基準模型，該模型利用具身代理間的共享記憶及代理級動態檢索技術。透過對MA-EgoQA上多種基準模型與EgoMAS的綜合評估，我們發現現有方法尚無法有效處理多視角串流，這凸顯了未來需在跨代理系統級理解方面取得突破。相關程式碼與基準測試已公開於https://ma-egoqa.github.io。

English

As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA-EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent-wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA-EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across the agents. The code and benchmark are available at https://ma-egoqa.github.io.

MA-EgoQA：基於多具身智能體第一人稱視角的問答系統

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

摘要

Support