PDFTriage:长格式结构化文档的问答
PDFTriage: Question Answering over Long, Structured Documents
September 16, 2023
作者: Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, Ryan A. Rossi, Franck Dernoncourt
cs.AI
摘要
大型语言模型(LLMs)在文档问答(QA)中存在问题,特别是当文档无法适应LLM的小上下文长度时。为了克服这一问题,大多数现有研究侧重于从文档中检索相关上下文,并将其表示为纯文本。然而,诸如PDF、网页和演示文稿等文档通常具有不同的页面、表格、章节等自然结构。将这些结构化文档表示为纯文本与用户对这些具有丰富结构的文档的心理模型不符。当系统需要查询文档以获取上下文时,这种不一致性就会显现出来,看似琐碎的问题可能会使QA系统出现问题。为了弥合处理结构化文档中的这一根本差距,我们提出了一种名为PDFTriage的方法,使模型能够基于结构或内容检索上下文。我们的实验展示了所提出的PDFTriage增强模型在多个类别的问题上的有效性,而现有的检索增强LLMs则失败了。为了促进对这一根本问题的进一步研究,我们发布了一个基准数据集,其中包含来自10个不同类别的80个结构化文档上的900多个人工生成的问题,用于文档问答。
English
Large Language Models (LLMs) have issues with document question answering
(QA) in situations where the document is unable to fit in the small context
length of an LLM. To overcome this issue, most existing works focus on
retrieving the relevant context from the document, representing them as plain
text. However, documents such as PDFs, web pages, and presentations are
naturally structured with different pages, tables, sections, and so on.
Representing such structured documents as plain text is incongruous with the
user's mental model of these documents with rich structure. When a system has
to query the document for context, this incongruity is brought to the fore, and
seemingly trivial questions can trip up the QA system. To bridge this
fundamental gap in handling structured documents, we propose an approach called
PDFTriage that enables models to retrieve the context based on either structure
or content. Our experiments demonstrate the effectiveness of the proposed
PDFTriage-augmented models across several classes of questions where existing
retrieval-augmented LLMs fail. To facilitate further research on this
fundamental problem, we release our benchmark dataset consisting of 900+
human-generated questions over 80 structured documents from 10 different
categories of question types for document QA.