MERRIN: 잡음이 많은 웹 환경에서의 다중 양식 증거 검색 및 추론 벤치마크

초록

불완전하게 명시되고 다중 도약(multi-hop) 특성을 지닌 검색 쿼리와 실제 웹 검색 결과의 다중 양식(multimodal), 이질적, 그리고 종종 상충하는 특성에 동기를 부여받아, 우리는 검색 강화 에이전트 평가를 위한 인간 주석 벤치마크인 MERRIN(Noisy Web Environments에서의 다중 양식 증거 검색 및 추론)을 소개한다. MERRIN은 AI 에이전트가 관련 양식을 식별하고, 다중 양식 증거를 검색하며, 노이즈가 많은 웹 소스에 대해 다중 도약 추론을 수행하는 능력을 측정한다. 이는 세 가지 중요한 측면에서 기존 연구와 차별된다: (1) 명시적 양식 단서 없이 자연어 쿼리를 사용, (2) 비디오 및 오디오와 같이 상대적으로 덜 탐구된 양식을 포함, (3) 웹 검색 중 복잡하고 종종 노이즈가 많거나 상충하는 다중 양식 증거의 검색을 요구. 우리는 강력한 클로즈드 소스 모델(GPT-5.4-mini, Gemini 3/3.1 Flash/Pro 등)과 오픈 웨이트 모델(Qwen3-4B/30B/235B)을 포함한 10개 모델로 구동되는 다양한 검색 에이전트를 세 가지 검색 설정(검색 없음, 기본 검색, 에이전트 검색)에서 평가했다. 우리의 결과는 MERRIN이 매우 도전적임을 보여준다: 모든 에이전트의 평균 정확도는 22.3%에 불과하며, 최고 성능 에이전트도 40.1%에 그친다. 우리는 Gemini Deep Research와 같은 강력한 에이전트가 더 높은 성능을 달성하지만, 과도한 탐색으로 인해 향상 폭이 제한적임을 추가로 관찰했다; 이러한 에이전트는 더 많은 단계를 거치고 더 많은 도구를 사용하지만, 종종 상충하거나 부분적으로 관련된 웹 콘텐츠에 주의가 분산되어 잘못된 답변을 내놓는다. 인간과 비교했을 때, 이러한 에이전트는 더 많은 리소스를 소비하면서도 정확도는 낮은데, 이는 비효율적인 소스 선택과 텍스트 양식에 대한 지나친 의존이 주요 원인이다. 이러한 발견은 노이즈가 많은 웹 환경에서 다양한 양식에 걸쳐 강건한 검색과 추론이 가능한 검색 에이전트의 필요성을 강조하며, MERRIN이 그러한 능력을 평가하는 데 유용한 테스트베드가 되게 한다.

English

Motivated by the underspecified, multi-hop nature of search queries and the multimodal, heterogeneous, and often conflicting nature of real-world web results, we introduce MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments), a human-annotated benchmark for evaluating search-augmented agents. MERRIN measures AI agents' ability to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning over noisy web sources. It differs from prior work in three important aspects: (1) using natural language queries without explicit modality cues, (2) incorporating underexplored modalities such as video and audio, and (3) requiring the retrieval of complex, often noisy or conflicting multimodal evidence during web search. We evaluate diverse search agents powered by ten models, including strong closed-source models (e.g., GPT-5.4-mini, Gemini 3/3.1 Flash/Pro) and open-weight models (Qwen3-4B/30B/235B), across three search settings (no search, native search, and agentic search). Our results show that MERRIN is highly challenging: the average accuracy across all agents is 22.3%, with the best-performing agent reaching only 40.1%. We further observe that while stronger agents like Gemini Deep Research achieve higher performance, gains are modest due to over-exploration; they take more steps and use more tools, but are often distracted by conflicting or partially relevant web content, leading to incorrect answers. Compared to humans, these agents consume more resources yet achieve lower accuracy, largely due to inefficient source selection and an overreliance on text modalities. These findings highlight the need for search agents capable of robust search and reasoning across diverse modalities in noisy web environments, making MERRIN a valuable testbed for evaluating such capabilities.

MERRIN: 잡음이 많은 웹 환경에서의 다중 양식 증거 검색 및 추론 벤치마크

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

초록

Support