PDFTriage:針對長篇結構化文件的問答
PDFTriage: Question Answering over Long, Structured Documents
September 16, 2023
作者: Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, Ryan A. Rossi, Franck Dernoncourt
cs.AI
摘要
大型語言模型(LLMs)在文件問答(QA)中存在問題,當文件無法符合LLM的小上下文長度時。為了克服這個問題,大多數現有的研究著重於從文件中檢索相關上下文,並將其表示為純文本。然而,像PDF、網頁和簡報這樣的文件通常具有不同的頁面、表格、章節等結構。將這些結構化文件表示為純文本與用戶對這些具有豐富結構的文件的心智模型不一致。當系統必須向文件查詢上下文時,這種不一致性就會凸顯出來,看似微不足道的問題可能會使QA系統出錯。為了彌合處理結構化文件中的這個基本差距,我們提出了一種名為PDFTriage的方法,該方法使模型能夠基於結構或內容檢索上下文。我們的實驗證明了所提出的PDFTriage增強模型在幾個問題類別上的有效性,而現有的檢索增強LLMs則失敗了。為了促進進一步研究這個基本問題,我們釋出了我們的基準數據集,其中包含來自10個不同問題類型的80個結構化文件中的900多個人工生成問題。
English
Large Language Models (LLMs) have issues with document question answering
(QA) in situations where the document is unable to fit in the small context
length of an LLM. To overcome this issue, most existing works focus on
retrieving the relevant context from the document, representing them as plain
text. However, documents such as PDFs, web pages, and presentations are
naturally structured with different pages, tables, sections, and so on.
Representing such structured documents as plain text is incongruous with the
user's mental model of these documents with rich structure. When a system has
to query the document for context, this incongruity is brought to the fore, and
seemingly trivial questions can trip up the QA system. To bridge this
fundamental gap in handling structured documents, we propose an approach called
PDFTriage that enables models to retrieve the context based on either structure
or content. Our experiments demonstrate the effectiveness of the proposed
PDFTriage-augmented models across several classes of questions where existing
retrieval-augmented LLMs fail. To facilitate further research on this
fundamental problem, we release our benchmark dataset consisting of 900+
human-generated questions over 80 structured documents from 10 different
categories of question types for document QA.