MERRIN：面向嘈杂网络环境的多模态证据检索与推理基准

摘要

受搜索查询语义模糊、多跳推理的特性以及现实网络结果多模态、异构且常含冲突的现状驱动，我们推出了MERRIN（多模态噪声网络环境证据检索与推理基准）——一个用于评估搜索增强智能体的人工标注基准。MERRIN通过三大核心维度衡量AI智能体的能力：识别相关模态、检索多模态证据、在噪声网络源上进行多跳推理。该基准与先前研究相比具有三个重要差异：（1）使用无显式模态提示的自然语言查询；（2）引入视频、音频等尚未充分探索的模态；（3）要求在网络搜索过程中检索复杂且常含噪声或冲突的多模态证据。我们在三种搜索场景（无搜索、原生搜索、智能体搜索）下评估了十类模型驱动的搜索智能体，包括强闭源模型（如GPT-5.4-mini、Gemini 3/3.1 Flash/Pro）和开源权重模型（Qwen3-4B/30B/235B）。实验表明MERRIN极具挑战性：所有智能体平均准确率仅为22.3%，最优模型仅达40.1%。进一步观察发现，尽管Gemini深度研究等强智能体表现更好，但因过度探索导致提升有限：它们使用更多步骤和工具，却常被冲突或部分相关的网络内容干扰而得出错误答案。与人类相比，这些智能体消耗更多资源但准确率更低，主要源于低效的源选择和对文本模态的过度依赖。这些发现凸显了开发能在噪声网络环境中进行跨模态稳健搜索与推理的智能体的必要性，使MERRIN成为评估此类能力的宝贵测试平台。

English

Motivated by the underspecified, multi-hop nature of search queries and the multimodal, heterogeneous, and often conflicting nature of real-world web results, we introduce MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments), a human-annotated benchmark for evaluating search-augmented agents. MERRIN measures AI agents' ability to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning over noisy web sources. It differs from prior work in three important aspects: (1) using natural language queries without explicit modality cues, (2) incorporating underexplored modalities such as video and audio, and (3) requiring the retrieval of complex, often noisy or conflicting multimodal evidence during web search. We evaluate diverse search agents powered by ten models, including strong closed-source models (e.g., GPT-5.4-mini, Gemini 3/3.1 Flash/Pro) and open-weight models (Qwen3-4B/30B/235B), across three search settings (no search, native search, and agentic search). Our results show that MERRIN is highly challenging: the average accuracy across all agents is 22.3%, with the best-performing agent reaching only 40.1%. We further observe that while stronger agents like Gemini Deep Research achieve higher performance, gains are modest due to over-exploration; they take more steps and use more tools, but are often distracted by conflicting or partially relevant web content, leading to incorrect answers. Compared to humans, these agents consume more resources yet achieve lower accuracy, largely due to inefficient source selection and an overreliance on text modalities. These findings highlight the need for search agents capable of robust search and reasoning across diverse modalities in noisy web environments, making MERRIN a valuable testbed for evaluating such capabilities.