R1-Searcher: 강화 학습을 통해 LLM의 검색 능력 강화하기

초록

기존의 대형 추론 모델(Large Reasoning Models, LRMs)은 강화 학습(Reinforcement Learning, RL)을 통해 대형 언어 모델(Large Language Models, LLMs)의 복잡한 추론 능력을 향상시킬 수 있는 잠재력을 보여주었습니다. 이러한 모델들은 수학 및 코딩과 같은 도전적인 과제에서 뛰어난 성능을 달성하지만, 문제를 해결하기 위해 내부 지식에 의존하는 경우가 많습니다. 이는 시간에 민감하거나 지식이 집약적인 질문에 대해 부적절할 수 있으며, 부정확성과 환각(hallucination)을 초래할 수 있습니다. 이를 해결하기 위해, 우리는 LLM의 검색 능력을 강화하기 위한 새로운 두 단계 결과 기반 RL 접근법인 R1-Searcher를 제안합니다. 이 방법은 LLM이 추론 과정에서 외부 검색 시스템을 자율적으로 호출하여 추가 지식에 접근할 수 있도록 합니다. 우리의 프레임워크는 전적으로 RL에 의존하며, 콜드 스타트를 위한 프로세스 보상이나 증류(distillation)가 필요하지 않습니다. 실험 결과, 우리의 방법은 이전의 강력한 RAG(Retrieval-Augmented Generation) 방법들을 크게 능가하며, 심지어 폐쇄형 GPT-4o-mini와 비교해서도 우수한 성능을 보여줍니다.

English

Existing Large Reasoning Models (LRMs) have shown the potential of reinforcement learning (RL) to enhance the complex reasoning capabilities of Large Language Models~(LLMs). While they achieve remarkable performance on challenging tasks such as mathematics and coding, they often rely on their internal knowledge to solve problems, which can be inadequate for time-sensitive or knowledge-intensive questions, leading to inaccuracies and hallucinations. To address this, we propose R1-Searcher, a novel two-stage outcome-based RL approach designed to enhance the search capabilities of LLMs. This method allows LLMs to autonomously invoke external search systems to access additional knowledge during the reasoning process. Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. % effectively generalizing to out-of-domain datasets and supporting both Base and Instruct models. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.

R1-Searcher: 강화 학습을 통해 LLM의 검색 능력 강화하기

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

초록

Support