MERRIN: 雑音の多いウェブ環境におけるマルチモーダル証跡検索と推論のベンチマーク

要旨

検索クエリの未詳細化かつマルチホップ的な性質、および実世界のウェブ検索結果が持つマルチモーダル性、異種混在性、しばしば矛盾を含む性質に動機づけられ、本論文ではMERRINを提案する。MERRINは、ノイズの多いウェブ環境下での検索拡張エージェントを評価するための人手注釈ベンチマークである。これは、AIエージェントが関連するモダリティを特定し、マルチモーダルな証拠を検索し、ノイズの多いウェブ情報源に対してマルチホップ推論を実行する能力を測定する。MERRINは以下の3点で従来研究と異なる。(1) 明示的なモダリティの手がかりを含まない自然言語クエリを使用する、(2) ビデオやオーディオなど未開拓のモダリティを組み込む、(3) ウェブ検索において、複雑でしばしばノイズが多い、または矛盾するマルチモーダルな証拠の検索を要求する。我々は、強力なクローズドソースモデル（GPT-5.4-mini、Gemini 3/3.1 Flash/Proなど）およびオープンウェイトモデル（Qwen3-4B/30B/235B）を含む10モデルを搭載した多様な検索エージェントを、3つの検索設定（検索無し、ネイティブ検索、エージェント的検索）で評価した。結果、MERRINが非常に困難な課題であることが示された：全エージェントの平均正解率は22.3%であり、最高性能のエージェントでも40.1%に留まった。さらに、Gemini Deep Researchのような強力なエージェントは高い性能を達成するものの、過剰な探索により効果は限定的であることが観察された。これらはより多くのステップを踏み、より多くのツールを使用するが、矛盾した情報や部分的な関連情報に注意を散らされ、誤った答えを導くことが多い。人間と比較して、これらのエージェントはより多くのリソースを消費するにも関わらず精度は低く、その主な原因は非効率な情報源選択とテキストモダリティへの過度な依存にある。これらの知見は、ノイズの多いウェブ環境下で多様なモダリティにわたる頑健な検索と推論が可能な検索エージェントの必要性を浮き彫りにしており、MERRINがそのような能力を評価する貴重なテストベッドとなることを示している。

English

Motivated by the underspecified, multi-hop nature of search queries and the multimodal, heterogeneous, and often conflicting nature of real-world web results, we introduce MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments), a human-annotated benchmark for evaluating search-augmented agents. MERRIN measures AI agents' ability to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning over noisy web sources. It differs from prior work in three important aspects: (1) using natural language queries without explicit modality cues, (2) incorporating underexplored modalities such as video and audio, and (3) requiring the retrieval of complex, often noisy or conflicting multimodal evidence during web search. We evaluate diverse search agents powered by ten models, including strong closed-source models (e.g., GPT-5.4-mini, Gemini 3/3.1 Flash/Pro) and open-weight models (Qwen3-4B/30B/235B), across three search settings (no search, native search, and agentic search). Our results show that MERRIN is highly challenging: the average accuracy across all agents is 22.3%, with the best-performing agent reaching only 40.1%. We further observe that while stronger agents like Gemini Deep Research achieve higher performance, gains are modest due to over-exploration; they take more steps and use more tools, but are often distracted by conflicting or partially relevant web content, leading to incorrect answers. Compared to humans, these agents consume more resources yet achieve lower accuracy, largely due to inefficient source selection and an overreliance on text modalities. These findings highlight the need for search agents capable of robust search and reasoning across diverse modalities in noisy web environments, making MERRIN a valuable testbed for evaluating such capabilities.

MERRIN: 雑音の多いウェブ環境におけるマルチモーダル証跡検索と推論のベンチマーク

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

要旨

Support