AGORA：一个基于档案的智能体工作场所文档推理基准

摘要

大型语言模型越来越多地被部署为基于文档进行推理的代理，而非依赖参数化知识回答问题。我们研究基于档案的推理：在一堆庞大杂乱的职场文件中定位稀疏证据，协调不一致的术语、单位与时间惯例，并计算出答案。现有基准仅涵盖该场景的局部环节，且没有哪个基准能同时强调档案基础性、代理探索性与跨领域覆盖性。我们提出Agora基准，将362个问题与8个领域的9664份真实文档（共3.72亿词元）配对，这远超任何模型的上下文窗口，因此代理必须审慎探索而非全面扫描。Agora通过一个代理化流水线构建，该流水线结合了跨文档任务合成、防泄露混淆处理以及难度过滤。在对8个模型进行评估后，我们发现该任务远未解决：即使最强模型也仅达到59.4%的准确率，且不同领域间差异显著。

English

Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer. Existing benchmarks address only parts of this setting and none jointly stresses archive-groundedness, agentic exploration, and cross-domain coverage. We introduce Agora, a benchmark pairing 362 questions with eight domain collections of 9,664 authentic documents and 372M tokens, far exceeding any model's context window, so agents must explore deliberately rather than scan exhaustively. Agora is built by an agentic pipeline combining cross-document task synthesis, leakage-preventing obfuscation, and difficulty filtering. Evaluating eight models, we find the task far from solved: even the strongest reaches only 59.4% accuracy, with notable variation across domains.