MM-BrowseComp: マルチモーダルブラウジングエージェントのための包括的ベンチマーク

要旨

高度な推論能力とツール使用能力を備えたAIエージェントは、深層検索におけるウェブブラウジングで印象的な性能を発揮しています。既存のベンチマークであるBrowseCompはこれらのブラウジング能力を評価しますが、主にテキスト情報に焦点を当てており、マルチモーダルコンテンツの普及を見落としています。このギャップを埋めるため、我々はMM-BrowseCompを導入します。これは、エージェントのマルチモーダル検索と推論能力を評価するために特別に設計された224の挑戦的な手作り問題からなる新しいベンチマークです。これらの問題は、プロンプトに画像を取り入れることが多く、検索と推論プロセスで遭遇する重要な情報も、ウェブページ上の画像や動画に埋め込まれている可能性があります。そのため、テキストのみに依存する手法は我々のベンチマークでは不十分です。さらに、各問題に対して検証済みのチェックリストを提供し、マルチモーダル依存性と推論経路の詳細な分析を可能にします。MM-BrowseCompにおける最先端モデルの包括的評価により、OpenAI o3のようなトップモデルでさえツールを使用しても29.02%の精度しか達成できないことが明らかになり、現在のモデルのマルチモーダル能力が最適でなく、ネイティブなマルチモーダル推論が欠如していることが強調されました。

English

AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02\% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.

MM-BrowseComp: マルチモーダルブラウジングエージェントのための包括的ベンチマーク

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

要旨

Support