OmniGAIA: 네이티브 올모달 AI 에이전트를 향하여

초록

인간의 지능은 시각, 청각, 언어에 걸친 전(全)모달리티 인식을 복잡한 추론 및 도구 사용과 자연스럽게 결합하여 세계와 상호작용합니다. 그러나 현재의 다중 모달리티 LLM은 주로 이중 모달리티 상호작용(예: 시각-언어)에 국한되어 있어 일반 AI 어시스턴트에 필요한 통합 인지 능력이 부족합니다. 이러한 격차를 해소하기 위해 우리는 비디오, 오디오, 이미지 모달리티에 걸친 심층 추론 및 다중 턴 도구 실행이 필요한 작업에서 전모달리티 에이전트를 평가하기 위한 포괄적인 벤치마크인 OmniGAIA를 소개합니다. 새로운 전모달리티 이벤트 그래프 접근법을 통해 구축된 OmniGAIA는 교차 모달리티 추론과 외부 도구 통합을 필요로 하는 현실 세계 데이터에서 유래한 복잡한 다중 홉 질의를 종합합니다. 더 나아가, 우리는 도구 통합 추론 패러다임과 능동적 전모달리티 인식을 기반으로 한 네이티브 전모달리티 기반 에이전트인 OmniAtlas를 제안합니다. 후견적 안내 트리 탐색 전략과 세분화된 오류 수정을 위한 OmniDPO를 통해 합성된 궤적으로 학습된 OmniAtlas는 기존 오픈소스 모델들의 도구 사용 능력을 효과적으로 향상시킵니다. 이 연구는 현실 세계 시나리오를 위한 차세대 네이티브 전모달리티 AI 어시스턴트로 나아가는 한 걸음을 표시합니다.

English

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

OmniGAIA: 네이티브 올모달 AI 에이전트를 향하여

OmniGAIA: Towards Native Omni-Modal AI Agents

초록

Support