s3：訓練搜索代理無需大量數據，強化學習足矣

摘要

檢索增強生成（RAG）系統賦予大型語言模型（LLMs）在推理過程中訪問外部知識的能力。近期進展使得LLMs能夠通過強化學習（RL）充當搜索代理，通過與檢索引擎的多輪互動來提升信息獲取效率。然而，現有方法要麼僅使用搜索專用指標（如NDCG）優化檢索，而忽略了下游任務的效用；要麼對整個LLM進行微調，使其同時進行推理與檢索——這將檢索與生成過程緊密耦合，限制了實際搜索效用及與凍結或專有模型的兼容性。本研究中，我們提出了s3，一個輕量級、模型無關的框架，它將搜索器與生成器解耦，並利用“超越RAG的增益”作為獎勵來訓練搜索器：即相較於基礎RAG在生成準確性上的提升。s3僅需2.4k訓練樣本即可超越基於超過70倍數據訓練的基線模型，在六個通用問答和五個醫療問答基準測試中持續展現更強的下游性能。

English

Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.

s3：訓練搜索代理無需大量數據，強化學習足矣

s3: You Don't Need That Much Data to Train a Search Agent via RL

摘要

Support