전략적 탐색인가 확률적 검색인가? 에이전트와 인간의 문서 컬렉션 추론 방식

초록

멀티모달 에이전트는 복잡한 문서 중심 워크플로우의 자동화를 위한 유망한 방안을 제시합니다. 그러나 중요한 의문이 남아있습니다: 이러한 에이전트가 진정한 전략적 추론을 보여주는가, 아니면 단순히 확률적인 시행착오 탐색에 그치는가? 이를 규명하기 위해 우리는 800개의 이질적인 PDF 문서를 바탕으로 한 2,250개의 인간 작성 질문으로 구성된 MADQA 벤치마크를 소개합니다. 고전 검사 이론에 기반하여, 우리는 다양한 수준의 에이전트 능력 간 변별력을 극대화하도록 설계했습니다. 에이전트 행동을 평가하기 위해 정확도와 노력 간 절충을 측정하는 새로운 평가 프로토콜을 도입합니다. 이 프레임워크를 사용하여 우리는 최고 수준의 에이전트가 원시 정확도에서는 인간 검색자와 필적할 수 있지만, 주로 상이한 질문에서 성공하며 약한 전략적 계획을 보상하기 위해 무차별 대입 탐색에 의존함을 보여줍니다. 에이전트는 오라클 성능 대비 약 20%에 가까운 격차를 좁히지 못하고 비생산적인 루프에 지속적으로 빠집니다. 우리는 데이터셋과 평가 도구를 공개하여 무차별 대입 검색에서 보정된 효율적 추론으로의 전환을 촉진하는 데 기여하고자 합니다.

English

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

전략적 탐색인가 확률적 검색인가? 에이전트와 인간의 문서 컬렉션 추론 방식

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

초록

Support