InterLV-Search：交织多模态智能体搜索基准测试

摘要

现有的多模态智能体搜索基准主要评估多模态搜索和视觉浏览能力，但视觉证据要么局限于输入阶段，要么被视为答案终点，而非交织搜索轨迹的一部分。为此，我们提出InterLV-Search基准，用于评估交织语言-视觉智能体搜索任务，其中文本与视觉证据被反复用于后续搜索条件的设定。该基准包含2061个样本，覆盖三个层级：主动视觉证据检索、受控离线交织多模态搜索、开放网络交织多模态搜索。与现有基准相比，它还包含多模态多分支样本，需在证据搜索过程中对多个实体进行对比。我们通过自动化流水线构建第一、第二层级，并通过机器主导、人工监督的开放网络流水线构建第三层级。此外，我们提供InterLV-Agent用于标准化工具使用、轨迹记录及评估。针对专有与开源多模态智能体的实验表明，当前系统远未解决交织多模态搜索问题——最佳模型总体准确率低于50%，凸显出视觉证据检索、搜索控制及多模态证据整合方面的挑战。我们已在https://github.com/hbhalpha/InterLV-Search-Bench发布基准数据与评估代码。

English

Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce InterLV-Search, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench

InterLV-Search：交织多模态智能体搜索基准测试

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

摘要

Support