文脈は常に不十分：長文書集合に対するスケーラブルな質問応答のための構造化推論（注：タイトルは学術論文の形式に合わせ、技術的な正確さを保ちつつ自然な日本語表現にしています。「Structured Reasoning」は「構造化推論」、「Scalable Question Answering」は「スケーラブルな質問応答」と訳しました）

要旨

現実世界の文書質問応答は困難な課題です。分析者は複数の文書間、そして各文書内の異なる部分にわたる証拠を統合する必要があります。しかし、文書コレクションが増大するにつれて、固定されたLLMのコンテキストウィンドウを超える可能性があります。一般的な対処法は、文書をチャンクに分解し、チャンク単位の出力から回答を組み立てることですが、これには集約のボトルネックが生じます。チャンク数が増加するにつれて、システムは抽出された証拠の膨大な集合を依然として統合し、推論しなければなりません。本研究では、構造化推論による長文書コレクションへの質問応答フレームワーク「SLIDERS」を提案します。SLIDERSは重要な情報をリレーショナルデータベースに抽出し、連結されたテキストではなくSQLを介して永続的な構造化状態に対するスケーラブルな推論を可能にします。この局所的に抽出された表現を大域的に一貫性のあるものとするため、SLIDERSはデータ調整ステージを導入し、由来情報、抽出根拠、メタデータを活用して重複、矛盾、不完全なレコードを検出および修復します。 SLIDERSは、強力なベースLLMのコンテキストウィンドウ内に収まる3つの既存長文コンテキストベンチマークにおいて、すべてのベースラインを凌駕し、GPT-4.1を平均6.6ポイント上回りました。さらに、390万トークンと3600万トークンという2つの新規ベンチマークでは、次点のベースラインをそれぞれ約19ポイント、約32ポイント改善しました。

English

Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

要旨

Support