WebExplorer: 장기적 웹 에이전트 훈련을 위한 탐색 및 진화

초록

대규모 언어 모델(LLMs)의 패러다임은 점점 더 에이전트 기반 애플리케이션으로 전환되고 있으며, 이 과정에서 웹 브라우징 기능은 다양한 온라인 소스로부터 정보를 검색하는 데 필수적입니다. 그러나 기존의 오픈소스 웹 에이전트들은 복잡한 작업에서 제한된 정보 탐색 능력을 보이거나 투명한 구현이 부족한 문제를 안고 있습니다. 본 연구에서는 이러한 문제의 핵심 원인이 정보 탐색을 위한 도전적인 데이터의 부족에 있음을 확인했습니다. 이 한계를 극복하기 위해, 우리는 모델 기반 탐색과 반복적이며 장문에서 단문으로의 질의 진화를 활용한 체계적인 데이터 생성 접근법인 WebExplorer를 소개합니다. 이 방법은 다단계 추론과 복잡한 웹 탐색을 요구하는 도전적인 질의-응답 쌍을 생성합니다. 우리가 정제한 고품질 데이터셋을 활용하여, 지도 미세 조정과 강화 학습을 통해 고급 웹 에이전트 WebExplorer-8B를 성공적으로 개발했습니다. 우리의 모델은 128K의 컨텍스트 길이와 최대 100회의 도구 호출을 지원하며, 장기적인 문제 해결이 가능합니다. 다양한 정보 탐색 벤치마크에서 WebExplorer-8B는 해당 규모에서 최고의 성능을 달성했습니다. 특히, 8B 크기의 모델임에도 불구하고 WebExplorer-8B는 강화 학습 훈련 후 평균 16회의 탐색을 효과적으로 수행하며, BrowseComp-en/zh에서 WebSailor-72B보다 높은 정확도를 달성하고, WebWalkerQA와 FRAMES에서 100B 파라미터 이하 모델 중 최고의 성능을 보였습니다. 이러한 정보 탐색 작업을 넘어, 우리의 모델은 지식 집약적인 QA 데이터만으로 훈련되었음에도 HLE 벤치마크에서 강력한 일반화 능력을 보였습니다. 이러한 결과는 우리의 접근법이 장기적인 웹 에이전트 개발을 위한 실용적인 방향임을 강조합니다.

English

The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.

WebExplorer: 장기적 웹 에이전트 훈련을 위한 탐색 및 진화

WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

초록

Support