情境永遠不夠長：基於結構化推理的長文件集可擴展問答方法

摘要

現實世界中的文件問答任務極具挑戰性。分析師需要綜合多份文件以及每份文件不同部分的證據。然而，隨著文件集合的擴增，任何固定的大型語言模型上下文窗口都可能超出容量限制。常見的解決方案是將文件分解為區塊，並從區塊層級的輸出組裝答案，但這種方法引入了聚合瓶頸：隨著區塊數量增加，系統仍需對不斷擴大的提取證據體進行整合與推理。我們提出SLIDERS框架，透過結構化推理實現長文件集合的問答。SLIDERS將關鍵資訊提取至關聯式資料庫，使其能透過SQL語句對持久化結構化狀態進行可擴展推理，而非依賴串接文字。為使本地提取的表述具全局連貫性，SLIDERS引入資料調和階段，利用溯源資訊、提取依據與元數據來檢測並修復重複、不一致及不完整的記錄。SLIDERS在三項現有長上下文基準測試中均超越所有基線模型，儘管這些測試資料皆可置於強力基礎LLM的上下文窗口內，其平均表現較GPT-4.1高出6.6分。在兩個分別達390萬和3600萬詞符的新基準測試中，SLIDERS相較次佳基線模型更分別提升約19分與32分。

English

Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.

情境永遠不夠長：基於結構化推理的長文件集可擴展問答方法

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

摘要

Support