提升智能RAG代理效能与准确性的测试时优化策略

摘要

检索增强生成（RAG）系统在处理复杂的多跳问题时面临挑战，研究者提出了迭代式智能体框架（如Search-R1）以应对此类复杂性。然而，这类方法可能引发效率问题：包括对已处理信息的重复检索，以及在当前生成提示中有效整合检索结果的语境化难题。这些问题会导致不必要的检索轮次、次优推理、答案不准确及令牌消耗增加。本文研究通过测试阶段改进Search-R1流程来缓解上述缺陷。具体而言，我们探索了两种模块的集成及其组合方案：一是语境化模块，用于将检索文档中的相关信息更有效地融入推理过程；二是去重模块，用次相关文档替换已检索内容。我们在HotpotQA和Natural Questions数据集上评估方案，采用精确匹配分数、LLM即裁判的答案正确性评估以及平均检索轮次作为指标。性能最优的改进方案采用GPT-4.1-mini实现语境化，与Search-R1基线相比，精确匹配分数提升5.6%，检索轮次减少10.5%，显著提高了答案准确性与检索效率。

English

Retrieval-Augmented Generation (RAG) systems face challenges with complex, multihop questions, and agentic frameworks such as Search-R1 (Jin et al., 2025), which operates iteratively, have been proposed to address these complexities. However, such approaches can introduce inefficiencies, including repetitive retrieval of previously processed information and challenges in contextualizing retrieved results effectively within the current generation prompt. Such issues can lead to unnecessary retrieval turns, suboptimal reasoning, inaccurate answers, and increased token consumption. In this paper, we investigate test-time modifications to the Search-R1 pipeline to mitigate these identified shortcomings. Specifically, we explore the integration of two components and their combination: a contextualization module to better integrate relevant information from retrieved documents into reasoning, and a de-duplication module that replaces previously retrieved documents with the next most relevant ones. We evaluate our approaches using the HotpotQA (Yang et al., 2018) and the Natural Questions (Kwiatkowski et al., 2019) datasets, reporting the exact match (EM) score, an LLM-as-a-Judge assessment of answer correctness, and the average number of turns. Our best-performing variant, utilizing GPT-4.1-mini for contextualization, achieves a 5.6% increase in EM score and reduces the number of turns by 10.5% compared to the Search-R1 baseline, demonstrating improved answer accuracy and retrieval efficiency.