自主探索式对弈：无监督推动智能体能力前沿

摘要

基于可验证奖励的强化学习（RLVR）已成为训练大语言模型智能体的主流技术。然而该方法高度依赖精心设计的任务查询与对应真值答案来提供准确奖励，这需要大量人工投入并阻碍强化学习的规模化进程，尤其在智能体场景下。尽管近期有研究探索任务合成方法，但生成任务的难度难以有效控制以提供优质强化学习训练。为实现更高可扩展性的智能体RLVR，我们探索了深度搜索智能体的自我博弈训练框架，让学习中的大语言模型通过多轮搜索引擎调用，同时扮演任务提出者与问题求解者双重角色。任务提出者负责生成具有明确定义真值答案且难度递增的深度搜索查询，问题求解者则尝试处理生成的搜索查询并输出正确答案预测。为确保每个生成查询具备准确真值，我们收集提出者轨迹中的所有搜索结果作为外部知识，通过检索增强生成技术验证所提查询在提供全部必要搜索文档时能否被正确回答。在此搜索自我博弈框架中，提出者与求解者通过竞争与合作实现智能体能力的协同进化。大量实验结果表明，该框架能在无监督条件下，无论是从零开始还是持续强化学习训练场景，均能显著提升搜索智能体在各类基准测试中的综合表现。代码已开源：https://github.com/Alibaba-Quark/SSP。

English

Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires massive human efforts and hinders the RL scaling processes, especially under agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer's trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents' performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at https://github.com/Alibaba-Quark/SSP.

自主探索式对弈：无监督推动智能体能力前沿

Search Self-play: Pushing the Frontier of Agent Capability without Supervision

摘要

Support