提升智能检索增强生成效率与准确性的测试时优化策略
Test-Time Strategies for More Efficient and Accurate Agentic RAG
March 12, 2026
作者: Brian Zhang, Deepti Guntur, Zhiyang Zuo, Abhinav Sharma, Shreyas Chaudhari, Wenlong Zhao, Franck Dernoncourt, Puneet Mathur, Ryan Rossi, Nedim Lipka
cs.AI
摘要
針對複雜的多跳問題,檢索增強生成(RAG)系統面臨諸多挑戰。為解決此類複雜性,研究界提出了迭代式運作的智能體框架(如Search-R1框架,Jin等人,2025年)。然而這類方法可能引發效率問題,包括重複檢索已處理信息,以及難以在當前生成提示中有效情境化檢索結果。這些缺陷易導致不必要的檢索輪次、次優推理、答案失準及標記消耗增加。
本文研究通過測試階段改進Search-R1流程來緩解上述問題。具體而言,我們探索兩種組件及其組合的整合方案:情境化模塊(用於將檢索文檔的相關信息更有效融入推理過程)與去重模塊(用次相關文檔替換已檢索內容)。我們採用HotpotQA(Yang等人,2018年)和Natural Questions(Kwiatkowski等人,2019年)數據集進行評估,彙報精確匹配(EM)分數、基於LLM的答案正確性評判結果以及平均檢索輪次。
實驗表明,使用GPT-4.1-mini實現情境化的最佳變體相比Search-R1基準線,EM分數提升5.6%,檢索輪次減少10.5%,顯著優化了答案準確性與檢索效率。
English
Retrieval-Augmented Generation (RAG) systems face challenges with complex, multihop questions, and agentic frameworks such as Search-R1 (Jin et al., 2025), which operates iteratively, have been proposed to address these complexities. However, such approaches can introduce inefficiencies, including repetitive retrieval of previously processed information and challenges in contextualizing retrieved results effectively within the current generation prompt. Such issues can lead to unnecessary retrieval turns, suboptimal reasoning, inaccurate answers, and increased token consumption.
In this paper, we investigate test-time modifications to the Search-R1 pipeline to mitigate these identified shortcomings. Specifically, we explore the integration of two components and their combination: a contextualization module to better integrate relevant information from retrieved documents into reasoning, and a de-duplication module that replaces previously retrieved documents with the next most relevant ones. We evaluate our approaches using the HotpotQA (Yang et al., 2018) and the Natural Questions (Kwiatkowski et al., 2019) datasets, reporting the exact match (EM) score, an LLM-as-a-Judge assessment of answer correctness, and the average number of turns.
Our best-performing variant, utilizing GPT-4.1-mini for contextualization, achieves a 5.6% increase in EM score and reduces the number of turns by 10.5% compared to the Search-R1 baseline, demonstrating improved answer accuracy and retrieval efficiency.