戰略導航還是隨機搜索？智能體與人類在文檔集合上的推理方式比較

摘要

多模態代理為自動化複雜文件密集型工作流程提供了前景廣闊的路徑。然而，一個關鍵問題依然存在：這些代理展現的是真正的策略性推理，抑或僅是隨機試錯式搜索？為此，我們提出MADQA基準測試，包含基於800份異質性PDF文檔構建的2,250道人編寫問題。以經典測驗理論為指導，我們通過最大化不同智能水平代理的區分度來設計此基準。為評估代理行為，我們引入創新評估協議以衡量準確性與耗能間的權衡關係。透過此框架，我們發現儘管頂尖代理在原始準確度上能媲美人類搜索者，但其成功解答的問題類型存在顯著差異，且依賴暴力搜索來彌補策略規劃能力的不足。這些代理未能縮小與理想性能間近20%的差距，反而持續陷入低效循環。我們公開數據集與評估工具包，以助力實現從暴力檢索到精準高效推理的轉型。

English

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

戰略導航還是隨機搜索？智能體與人類在文檔集合上的推理方式比較

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

摘要

Support