语境永不足够长：面向长文档集可扩展问答的结构化推理

摘要

现实世界中的文档问答任务充满挑战。分析师需要综合多份文档及每份文档不同部分的证据信息。然而随着文档集合的增长，任何固定大小的LLM上下文窗口都可能被突破。常见的解决方案是将文档分解为片段，并通过片段级输出组装答案，但这会引入聚合瓶颈：随着片段数量增加，系统仍需对不断扩大的提取证据体进行整合推理。我们提出SLIDERS框架，通过结构化推理实现长文档集合的问答。该框架将关键信息提取至关系型数据库，使得能够通过SQL而非拼接文本来对持久化结构化状态进行可扩展推理。为确保局部提取的表征具有全局一致性，SLIDERS引入了数据协调阶段，利用数据溯源、提取依据和元数据来检测并修复重复、矛盾和不完整的记录。在三个现有长上下文基准测试中，SLIDERS均优于所有基线方法（尽管所有测试内容均未超出强基础LLM的上下文窗口），平均成绩较GPT-4.1高出6.6分。在两个分别包含390万和3600万token的新基准测试中，其性能较次优基线分别提升约19分和32分。

English

Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.

语境永不足够长：面向长文档集可扩展问答的结构化推理

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

摘要

Support