자기 대전 탐색: 감독 없이 에이전트 역량의 한계를 넓혀가기

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 LLM 에이전트 훈련의 주류 기술로 자리 잡았습니다. 그러나 RLVR은 정확한 보상을 제공하기 위해 정교하게 설계된 작업 질의와 이에 상응하는 정답에 크게 의존하는데, 이는 많은 인간의 노력을 요구하며 특히 에이전트 시나리오에서 RL 확장 과정을 저해합니다. 최근 몇몇 연구에서 작업 합성 방법을 탐구했지만, 생성된 에이전트 작업의 난이도를 효과적인 RL 훈련 이점을 제공할 수 있도록 제어하는 것은 매우 어려웠습니다. 더 높은 확장성을 지닌 에이전트 RLVR을 달성하기 위해, 우리는 딥 검색 에이전트를 위한 자기 주도적 훈련(self-play training)을 탐구합니다. 여기서 학습 LLM은 다중 턴 검색 엔진 호출을 활용하며 동시에 작업 제안자와 문제 해결자 역할을 수행합니다. 작업 제안자는 명확하게 정의된 정답과 점차 증가하는 작업 난이도를 가진 딥 검색 질의를 생성하는 것을 목표로 합니다. 문제 해결자는 생성된 검색 질의를 처리하고 정답 예측을 출력하려고 시도합니다. 생성된 각 검색 질의가 정확한 정답을 가지도록 보장하기 위해, 우리는 제안자의 행동 경로에서 모든 검색 결과를 수집하여 외부 지식으로 활용한 후, 검색 증강 생성(RAG)을 수행하여 제안된 질의가 필요한 모든 검색 문서가 제공될 때 정확히 답변될 수 있는지 테스트합니다. 이 검색 자기 주도 게임(SSP)에서 제안자와 해결자는 경쟁과 협력을 통해 에이전트 능력을 공동으로 진화시킵니다. 상당한 실험 결과를 통해 우리는 SSP가 초기 훈련부터 지속적인 RL 훈련 설정에 이르기까지 다양한 벤치마크에서 어떠한 감독 없이도 검색 에이전트의 성능을 균일하게 크게 향상시킬 수 있음을 확인했습니다. 코드는 https://github.com/Alibaba-Quark/SSP에서 확인할 수 있습니다.

English

Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires massive human efforts and hinders the RL scaling processes, especially under agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer's trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents' performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at https://github.com/Alibaba-Quark/SSP.

자기 대전 탐색: 감독 없이 에이전트 역량의 한계를 넓혀가기

Search Self-play: Pushing the Frontier of Agent Capability without Supervision

초록

Support