SAGE: Valutazione e Miglioramento del Recupero delle Informazioni per Agenti di Ricerca Profonda

Abstract

Gli agenti di ricerca avanzata sono emersi come sistemi potenti per affrontare query complesse. Nel frattempo, i retriever basati su LLM hanno dimostrato una forte capacità nel seguire istruzioni o nel ragionamento. Ciò solleva una questione cruciale: i retriever basati su LLM possono contribuire efficacemente ai flussi di lavoro degli agenti di ricerca avanzata? Per indagare ciò, introduciamo SAGE, un benchmark per il recupero di letteratura scientifica composto da 1.200 query in quattro domini scientifici, con un corpus di recupero di 200.000 articoli. Valutiamo sei agenti di ricerca avanzata e riscontriamo che tutti i sistemi faticano con il recupero ad alta intensità di ragionamento. Utilizzando DR Tulu come backbone, confrontiamo ulteriormente BM25 e i retriever basati su LLM (ovvero ReasonIR e gte-Qwen2-7B-instruct) come strumenti di ricerca alternativi. Sorprendentemente, BM25 supera significativamente i retriever basati su LLM di circa il 30%, poiché gli agenti esistenti generano sotto-query orientate alle parole chiave. Per migliorare le prestazioni, proponiamo un framework di scaling a livello di corpus in fase di test che utilizza gli LLM per arricchire i documenti con metadati e parole chiave, rendendo più semplice il recupero per i retriever standard. Ciò produce guadagni dell'8% e del 2% rispettivamente su domande a risposta breve e a risposta aperta.

English

Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus.We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.

SAGE: Valutazione e Miglioramento del Recupero delle Informazioni per Agenti di Ricerca Profonda

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Abstract

Support