PDFTriage: Risposta a Domande su Documenti Lunghi e Strutturati

Abstract

I Large Language Model (LLM) presentano problemi nel rispondere a domande su documenti (QA) in situazioni in cui il documento non può essere inserito nel breve contesto di un LLM. Per superare questo problema, la maggior parte dei lavori esistenti si concentra sul recupero del contesto rilevante dal documento, rappresentandolo come testo semplice. Tuttavia, documenti come PDF, pagine web e presentazioni sono naturalmente strutturati con diverse pagine, tabelle, sezioni e così via. Rappresentare tali documenti strutturati come testo semplice è incongruente con il modello mentale che l'utente ha di questi documenti, ricchi di struttura. Quando un sistema deve interrogare il documento per ottenere il contesto, questa incongruenza emerge in primo piano, e domande apparentemente banali possono mettere in difficoltà il sistema QA. Per colmare questa lacuna fondamentale nella gestione di documenti strutturati, proponiamo un approccio chiamato PDFTriage che consente ai modelli di recuperare il contesto in base alla struttura o al contenuto. I nostri esperimenti dimostrano l'efficacia dei modelli potenziati da PDFTriage su diverse classi di domande in cui i LLM esistenti arricchiti con tecniche di recupero falliscono. Per facilitare ulteriori ricerche su questo problema fondamentale, rilasciamo il nostro dataset di benchmark composto da oltre 900 domande generate da esseri umani su 80 documenti strutturati, con 10 diverse categorie di tipi di domande per il QA su documenti.

English

Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA.

PDFTriage: Risposta a Domande su Documenti Lunghi e Strutturati

PDFTriage: Question Answering over Long, Structured Documents

Abstract

Support