OmniGAIA: ネイティブな全モーダルAIエージェントに向けて

要旨

人間の知能は、視覚、聴覚、言語にわたる全モーダル知覚を、複雑な推論や道具の使用と自然に統合し、世界と相互作用している。しかし、現在のマルチモーダル大規模言語モデルは主に二モーダル間の相互作用（例：視覚-言語）に限定され、汎用AIアシスタントに必要な統一的な認知能力を欠いている。この隔たりを埋めるため、我々はOmniGAIAを提案する。これは映像、音声、画像モダリティにわたる深い推論と複数ターンにわたる道具実行を必要とするタスクにおいて、全モーダルエージェントを評価する包括的ベンチマークである。革新的な全モーダル事象グラフ手法により構築されたOmniGAIAは、実世界データから導出された、クロスモーダル推論と外部道具統合を必要とする複雑なマルチホップクエリを合成する。さらに、我々はOmniAtlasを提案する。これは道具統合推論パラダイムの下、能動的全モーダル知覚を備えたネイティブな全モーダル基盤エージェントである。後悔情報誘導型木探索戦略により合成された軌跡と、細粒度誤り修正のためのOmniDPOを用いて学習されたOmniAtlasは、既存のオープンソースモデルの道具使用能力を効果的に強化する。本研究成果は、実世界シナリオにおける次世代ネイティブ全モーダルAIアシスタントへの一歩を記すものである。

English

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

OmniGAIA: ネイティブな全モーダルAIエージェントに向けて

OmniGAIA: Towards Native Omni-Modal AI Agents

要旨

Support