PDFTriage：針對長篇結構化文件的問答

摘要

大型語言模型（LLMs）在文件問答（QA）中存在問題，當文件無法符合LLM的小上下文長度時。為了克服這個問題，大多數現有的研究著重於從文件中檢索相關上下文，並將其表示為純文本。然而，像PDF、網頁和簡報這樣的文件通常具有不同的頁面、表格、章節等結構。將這些結構化文件表示為純文本與用戶對這些具有豐富結構的文件的心智模型不一致。當系統必須向文件查詢上下文時，這種不一致性就會凸顯出來，看似微不足道的問題可能會使QA系統出錯。為了彌合處理結構化文件中的這個基本差距，我們提出了一種名為PDFTriage的方法，該方法使模型能夠基於結構或內容檢索上下文。我們的實驗證明了所提出的PDFTriage增強模型在幾個問題類別上的有效性，而現有的檢索增強LLMs則失敗了。為了促進進一步研究這個基本問題，我們釋出了我們的基準數據集，其中包含來自10個不同問題類型的80個結構化文件中的900多個人工生成問題。

English

Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA.

PDFTriage：針對長篇結構化文件的問答

PDFTriage: Question Answering over Long, Structured Documents

摘要

Support