自己対戦による探索: 教師なしでエージェント能力の限界に挑む

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデルエージェントを訓練する主流技術となっている。しかし、RLVRは正確な報酬を提供するために、精巧に設計されたタスククエリと対応する正解答案に強く依存しており、多大な人的労力を必要とし、特にエージェントシナリオ下でのRLスケーリングプロセスを妨げている。最近ではタスク合成手法を探求する研究がいくつか見られるが、生成されたエージェントタスクの難易度を制御することは難しく、効果的なRL訓練の優位性を提供するには至っていない。スケーラビリティの高いエージェント型RLVRを実現するため、我々は深層検索エージェントに対する自己対戦訓練を探求する。この手法では、学習中の大規模言語モデルがマルチターンでの検索エンジン呼び出しを利用し、タスク提案者と問題解決者の両方の役割を同時に果たす。タスク提案者は、明確に定義された正解答案を持ち、難易度が増していく深層検索クエリを生成することを目的とする。問題解決者は、生成された検索クエリを処理し、正しい答えの予測を出力しようと試みる。生成される各検索クエリが正確な正解を持つことを保証するため、提案者の軌跡から全ての検索結果を外部知識として収集し、検索拡張生成（RAG）を実行して、提供された全ての必要な検索文書を用いて提案されたクエリに正しく答えられるかどうかをテストする。この検索自己対戦（SSP）ゲームにおいて、提案者と解決者は競争と協力を通じて互いのエージェント能力を共進化させる。大規模な実験結果から、SSPがゼロからのRL訓練設定と継続的RL訓練設定の両方において、一切の教師信号なしに、様々なベンチマークで検索エージェントの性能を一貫して大幅に向上させ得ることがわかった。コードはhttps://github.com/Alibaba-Quark/SSP で公開されている。

English

Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires massive human efforts and hinders the RL scaling processes, especially under agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer's trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents' performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at https://github.com/Alibaba-Quark/SSP.