AGORA: 아카이브 기반 에이전트적 직장 문서 추론을 위한 벤치마크

초록

대규모 언어 모델은 매개변수 지식으로 답변하기보다 문서를 추론하는 에이전트로 점점 더 배치되고 있다. 우리는 아카이브 기반 추론을 연구한다. 이는 크고 복잡한 업무 파일 모음에서 희소한 증거를 찾아내고, 일관되지 않은 용어, 단위 및 시간 표기법을 조정하여 답을 계산하는 것을 의미한다. 기존 벤치마크는 이 설정의 일부만 다루며, 아카이브 기반성, 에이전트적 탐색, 교차 도메인 범위를 동시에 강조하는 벤치마크는 없다. 우리는 Agora를 소개한다. 이는 362개의 질문과 9,664개의 실제 문서, 3억 7,200만 개의 토큰으로 구성된 8개 도메인 컬렉션을 짝지은 벤치마크로, 어떤 모델의 컨텍스트 윈도우보다 훨씬 크기 때문에 에이전트는 철저히 스캔하기보다 의도적으로 탐색해야 한다. Agora는 문서 간 작업 합성, 누출 방지 난독화, 난이도 필터링을 결합한 에이전트 파이프라인에 의해 구축되었다. 여덟 개의 모델을 평가한 결과, 이 작업이 아직 해결되지 않았음을 발견했다. 가장 강력한 모델조차도 59.4%의 정확도에 그치며, 도메인 간에 현저한 변동을 보인다.

English

Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer. Existing benchmarks address only parts of this setting and none jointly stresses archive-groundedness, agentic exploration, and cross-domain coverage. We introduce Agora, a benchmark pairing 362 questions with eight domain collections of 9,664 authentic documents and 372M tokens, far exceeding any model's context window, so agents must explore deliberately rather than scan exhaustively. Agora is built by an agentic pipeline combining cross-document task synthesis, leakage-preventing obfuscation, and difficulty filtering. Evaluating eight models, we find the task far from solved: even the strongest reaches only 59.4% accuracy, with notable variation across domains.