SSRL: 自己探索型強化学習

要旨

大規模言語モデル（LLM）が強化学習（RL）におけるエージェント探索タスクの効率的なシミュレーターとして機能する可能性を調査し、外部検索エンジンとのコストのかかる相互作用への依存を軽減することを目指します。この目的のために、まず、構造化されたプロンプティングと反復サンプリングを通じてLLMの内在的な探索能力を定量化し、これを「Self-Search」と呼びます。結果として、LLMが推論予算に関して強いスケーリング特性を示し、難易度の高いBrowseCompタスクを含む質問応答ベンチマークで高いpass@kを達成することが明らかになりました。これらの観察に基づき、フォーマットベースおよびルールベースの報酬を通じてLLMのSelf-Search能力を強化する「Self-Search RL（SSRL）」を導入します。SSRLは、モデルが外部ツールへのアクセスを必要とせずに、内部で知識の利用を反復的に洗練することを可能にします。実証評価により、SSRLで訓練されたポリシーモデルが、探索駆動型RLトレーニングのためのコスト効率が高く安定した環境を提供し、外部検索エンジンへの依存を軽減し、堅牢なシミュレーションから現実への転移を促進することが示されました。以下の結論を導き出します：1）LLMは、高いパフォーマンスを達成するために効果的に引き出せる世界知識を有している；2）SSRLは、内部知識を活用して幻覚を減らす可能性を示している；3）SSRLで訓練されたモデルは、追加の努力なしに外部検索エンジンとシームレスに統合する。我々の知見は、LLMがよりスケーラブルなRLエージェントトレーニングを支援する可能性を強調しています。

English

We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.

SSRL: 自己探索型強化学習

SSRL: Self-Search Reinforcement Learning

要旨

Support