10턴 이상: 대규모 비동기 RL을 통한 장기적 에이전트 탐색의 잠재력 개방

초록

LLM 기반 에이전트의 최근 발전은 외부 도구를 통합함으로써 복잡하고 지식 집약적인 작업을 처리하는 데 있어 놀라운 능력을 보여주고 있습니다. 다양한 도구 선택 중에서 검색 도구는 방대한 외부 지식에 접근하는 데 핵심적인 역할을 합니다. 그러나 오픈소스 에이전트들은 여전히 전문가 수준의 검색 지능(Search Intelligence), 즉 모호한 쿼리를 해결하고 정확한 검색을 생성하며 결과를 분석하고 철저한 탐색을 수행하는 능력을 달성하는 데 미치지 못하고 있습니다. 기존 접근 방식은 확장성, 효율성, 데이터 품질 측면에서 부족함을 보입니다. 예를 들어, 기존의 온라인 강화학습(RL) 방법에서의 작은 턴 제한(예: <=10)은 복잡한 전략 학습을 제한합니다. 본 논문은 검색 에이전트의 대규모 RL 훈련을 위한 오픈소스 프로젝트인 ASearcher를 소개합니다. 우리의 주요 기여는 다음과 같습니다: (1) 장기적인 검색을 가능하게 하면서도 높은 훈련 효율성을 유지하는 확장 가능한 완전 비동기식 RL 훈련. (2) 고품질이고 도전적인 QA를 자율적으로 합성하여 대규모 QA 데이터셋을 생성하는 프롬프트 기반 LLM 에이전트. RL 훈련을 통해, 우리의 프롬프트 기반 QwQ-32B 에이전트는 xBench와 GAIA에서 각각 46.7%와 20.8%의 Avg@4 성능 향상을 달성했습니다. 특히, 우리의 에이전트는 훈련 시간 동안 40회 이상의 툴 호출과 150k 이상의 출력 토큰을 보이는 극단적인 장기 검색을 보여줍니다. 간단한 에이전트 설계와 외부 LLM 없이, ASearcher-Web-QwQ는 xBench에서 42.1, GAIA에서 52.8의 Avg@4 점수를 달성하여 기존의 오픈소스 32B 에이전트들을 능가합니다. 우리는 모델, 훈련 데이터, 코드를 https://github.com/inclusionAI/ASearcher에서 오픈소스로 공개합니다.

English

Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.

10턴 이상: 대규모 비동기 RL을 통한 장기적 에이전트 탐색의 잠재력 개방

Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

초록

Support