戦略的ナビゲーションか確率的探索か？エージェントと人間の文書集合に対する推論手法

要旨

マルチモーダルエージェントは、文書集約型の複雑なワークフローを自動化する有望な道筋を示しています。しかし、重要な疑問が残されています：これらのエージェントは真の戦略的推論を示しているのか、それとも単なる確率的な試行錯誤検索に過ぎないのか？この問題に取り組むため、私たちはMADQAを導入します。これは800の多種多様なPDF文書に基づく2,250の人間作成の質問からなるベンチマークです。古典的テスト理論に導かれて、私たちはエージェント能力の様々なレベル間で識別力を最大化するよう設計しました。エージェント的行動を評価するため、精度と努力のトレードオフを測定する新しい評価プロトコルを導入します。この枠組みを用いて、最良のエージェントが生の精度では人間の検索者に匹敵し得るものの、彼らが成功する質問は大きく異なり、弱い戦略的計画を補うために力任せの検索に依存していることを示します。エージェントはオラクル性能との約20%のギャップを埋められず、非生産的なループに陥り続けています。私たちはデータセットと評価ハーネスを公開し、力任せの検索から較正された効率的な推論への移行を促進します。

English

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

戦略的ナビゲーションか確率的探索か？エージェントと人間の文書集合に対する推論手法

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

要旨

Support