FreshStack: Costruire Benchmark Realistici per Valutare il Recupero di Documenti Tecnici

Abstract

Presentiamo FreshStack, un framework riutilizzabile per la costruzione automatica di benchmark di valutazione per il recupero delle informazioni (IR) a partire da domande e risposte della comunità. FreshStack esegue i seguenti passaggi: (1) raccolta automatica del corpus da codice e documentazione tecnica, (2) generazione di "nugget" (frammenti informativi) da domande e risposte della comunità, e (3) supporto a livello di nugget, recuperando documenti mediante una fusione di tecniche di recupero e architetture ibride. Utilizziamo FreshStack per costruire cinque dataset su argomenti di nicchia, recenti e in rapida crescita, al fine di garantire che i compiti siano sufficientemente impegnativi. Su FreshStack, i modelli di recupero esistenti, quando applicati senza modifiche, ottengono prestazioni significativamente inferiori rispetto agli approcci oracolari su tutti e cinque gli argomenti, indicando un ampio margine di miglioramento per la qualità dell'IR. Inoltre, identifichiamo casi in cui i sistemi di riordinamento (rerankers) non migliorano chiaramente l'accuratezza del recupero nella prima fase (due su cinque argomenti). Speriamo che FreshStack possa facilitare futuri lavori verso la costruzione di benchmark di valutazione per l'IR e il RAG realistici, scalabili e non contaminati. I dataset di FreshStack sono disponibili all'indirizzo: https://fresh-stack.github.io.

English

We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not clearly improve first-stage retrieval accuracy (two out of five topics). We hope that FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks. FreshStack datasets are available at: https://fresh-stack.github.io.

FreshStack: Costruire Benchmark Realistici per Valutare il Recupero di Documenti Tecnici

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Abstract

Support