PDFTriage：长格式结构化文档的问答

摘要

大型语言模型（LLMs）在文档问答（QA）中存在问题，特别是当文档无法适应LLM的小上下文长度时。为了克服这一问题，大多数现有研究侧重于从文档中检索相关上下文，并将其表示为纯文本。然而，诸如PDF、网页和演示文稿等文档通常具有不同的页面、表格、章节等自然结构。将这些结构化文档表示为纯文本与用户对这些具有丰富结构的文档的心理模型不符。当系统需要查询文档以获取上下文时，这种不一致性就会显现出来，看似琐碎的问题可能会使QA系统出现问题。为了弥合处理结构化文档中的这一根本差距，我们提出了一种名为PDFTriage的方法，使模型能够基于结构或内容检索上下文。我们的实验展示了所提出的PDFTriage增强模型在多个类别的问题上的有效性，而现有的检索增强LLMs则失败了。为了促进对这一根本问题的进一步研究，我们发布了一个基准数据集，其中包含来自10个不同类别的80个结构化文档上的900多个人工生成的问题，用于文档问答。

English

Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA.

PDFTriage：长格式结构化文档的问答

PDFTriage: Question Answering over Long, Structured Documents

摘要

Support