ReSearch: 強化学習による探索を用いた大規模言語モデルの推論能力の学習

要旨

大規模言語モデル（LLM）は、OpenAI-o1やDeepSeek-R1の成功に代表されるように、推論能力において顕著な性能を示しています。しかし、外部検索プロセスと推論を統合することは依然として課題であり、特に複数の検索ステップを必要とする複雑なマルチホップ質問に対して困難が残っています。本研究では、ReSearchという新しいフレームワークを提案します。これは、推論ステップに関する教師データを使用せずに、強化学習を通じてLLMに検索を伴う推論を訓練するものです。我々のアプローチでは、検索操作を推論チェーンの不可欠な要素として扱い、いつどのように検索を実行するかはテキストベースの思考によって導かれ、検索結果はその後の推論に影響を与えます。ReSearchをQwen2.5-7B(-Instruct)およびQwen2.5-32B(-Instruct)モデルで訓練し、広範な実験を行いました。1つのデータセットのみで訓練されたにもかかわらず、我々のモデルは様々なベンチマークで強い汎化性能を示しました。分析の結果、ReSearchは強化学習プロセス中に、反射や自己修正といった高度な推論能力を自然に引き出すことが明らかになりました。

English

Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2.5-7B(-Instruct) and Qwen2.5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.

ReSearch: 強化学習による探索を用いた大規模言語モデルの推論能力の学習

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

要旨

Support