OmniGAIA：迈向原生全模态人工智能体

摘要

人类智能天然融合了全模态感知（涵盖视觉、音频与语言）与复杂推理及工具使用能力，以此与世界互动。然而当前多模态大语言模型主要局限于双模态交互（如视觉-语言），缺乏通用AI助手所需的统一认知能力。为弥补这一差距，我们推出OmniGAIA——一个综合性基准测试平台，旨在评估全模态智能体在处理视频、音频和图像模态任务时所需的深度推理与多轮工具执行能力。通过创新的全模态事件图谱构建方法，OmniGAIA基于真实世界数据生成需要跨模态推理和外部工具整合的复杂多跳查询。此外，我们提出OmniAtlas：一种工具集成推理范式下的原生全模态基础智能体，具备主动全模态感知能力。该模型通过 hindsight 引导的树状探索策略合成训练轨迹，并采用OmniDPO进行细粒度纠错，有效提升了现有开源模型的工具使用能力。本工作标志着面向真实场景的新一代原生全模态AI助手迈出了重要一步。

English

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

OmniGAIA：迈向原生全模态人工智能体

OmniGAIA: Towards Native Omni-Modal AI Agents

摘要

Support