AGORA: アーカイブに基づくエージェント型業務文書推論のためのベンチマーク

要旨

大規模言語モデルは、パラメトリック知識から回答するのではなく、文書を基に推論するエージェントとして展開されることが増えている。本研究では、アーカイブに基づく推論（archive-grounded reasoning）を扱う。すなわち、大規模で雑多な職場ファイル群から散在する証拠を特定し、不統一な用語、単位、時間表記を調整し、答えを導き出すことである。既存のベンチマークはこの設定の一部しか対象としておらず、アーカイブ基盤性、エージェントによる探索、クロスドメインカバレッジを同時に重視するものは存在しない。本稿では、ベンチマーク「Agora」を導入する。これは362の質問と、9,664件の本物の文書と3億7,200万トークンからなる8つのドメインコレクションを組み合わせたものであり、どのモデルのコンテキストウィンドウもはるかに超えるため、エージェントは網羅的にスキャンするのではなく、意図的に探索しなければならない。Agoraは、クロスドキュメントタスク合成、リーク防止の難読化、難易度フィルタリングを組み合わせたエージェント型パイプラインによって構築されている。 8つのモデルを評価した結果、このタスクは解決にはほど遠いことが明らかになった。最強のモデルでも精度は59.4%にとどまり、ドメイン間で顕著なばらつきが見られる。

English

Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer. Existing benchmarks address only parts of this setting and none jointly stresses archive-groundedness, agentic exploration, and cross-domain coverage. We introduce Agora, a benchmark pairing 362 questions with eight domain collections of 9,664 authentic documents and 372M tokens, far exceeding any model's context window, so agents must explore deliberately rather than scan exhaustively. Agora is built by an agentic pipeline combining cross-document task synthesis, leakage-preventing obfuscation, and difficulty filtering. Evaluating eight models, we find the task far from solved: even the strongest reaches only 59.4% accuracy, with notable variation across domains.