s3: 검색 에이전트를 RL로 훈련시키기 위해 그렇게 많은 데이터가 필요하지 않습니다

초록

검색 강화 생성(Retrieval-Augmented Generation, RAG) 시스템은 대규모 언어 모델(LLM)이 추론 과정에서 외부 지식에 접근할 수 있도록 지원합니다. 최근 발전을 통해 LLM은 강화 학습(RL)을 통해 검색 에이전트로 작동할 수 있게 되었으며, 이는 검색 엔진과의 다중 턴 상호작용을 통해 정보 획득을 개선합니다. 그러나 기존 접근 방식은 하류 작업의 유용성을 무시하는 검색 전용 지표(예: NDCG)를 사용하여 검색을 최적화하거나, 전체 LLM을 미세 조정하여 추론과 검색을 결합함으로써 검색을 생성과 얽히게 하고, 실제 검색 유용성과 고정 또는 독점 모델과의 호환성을 제한합니다. 본 연구에서는 검색기와 생성기를 분리하고, 검색기를 'Gain Beyond RAG' 보상(단순 RAG 대비 생성 정확도 개선)을 사용하여 훈련시키는 경량화된 모델-불가지론적 프레임워크인 s3를 제안합니다. s3는 단 2.4k개의 훈련 샘플만으로 70배 이상 많은 데이터로 훈련된 베이스라인을 능가하며, 6개의 일반 QA 벤치마크와 5개의 의료 QA 벤치마크에서 일관되게 더 강력한 하류 작업 성능을 제공합니다.

English

Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.

s3: 검색 에이전트를 RL로 훈련시키기 위해 그렇게 많은 데이터가 필요하지 않습니다

s3: You Don't Need That Much Data to Train a Search Agent via RL

초록

Support