战略导航还是随机搜索?智能体与人类如何对文档集合进行推理
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
March 12, 2026
作者: Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta
cs.AI
摘要
多模态智能体为实现复杂文档密集型工作流的自动化提供了前景广阔的路径。然而一个关键问题依然存在:这些智能体展现的是真正的战略推理能力,还是仅仅依靠随机试错搜索?为探究这一问题,我们推出了MADQA基准测试集——基于800份异构PDF文档构建的2,250道人机交互问题集。该基准严格遵循经典测试理论设计,旨在最大化区分不同层级智能体能力的判别力。针对智能体行为评估,我们创新性地提出了衡量精度-效能权衡的评估协议。通过该框架的实证研究表明,尽管顶尖智能体的原始准确率可媲美人类搜索者,但其成功解决的问题类型与人类存在显著差异,且依赖暴力搜索来弥补战略规划能力的不足。它们始终无法弥合与理论最优性能近20%的差距,并会陷入无效循环。我们开源此数据集与评估工具包,以助力实现从暴力检索到精准高效推理的范式转变。
English
Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.