QUEST: 완전 합성 태스크를 통한 프론티어 심층 연구 에이전트 훈련

초록

심층 연구 에이전트는 검색 엔진의 역할을 키워드 일치 페이지 검색에서 지식 종합으로 확장하여, 인간이 정보와 상호작용하는 방식을 근본적으로 변화시킵니다. 그러나 최첨단 시스템은 여전히 독점적이며, 기존 오픈 에이전트는 다양한 작업 유형에서 일반화 성능이 낮은 경우가 많아, 광범위한 능력을 갖춘 심층 연구 에이전트를 학습하는 방법이 불분명한 상태입니다. 우리는 QUEST를 공개합니다. QUEST는 2B에서 35B까지 다양한 크기의 오픈 모델 제품군으로, 다양한 장기 검색 작업을 처리하도록 설계된 범용 심층 연구 에이전트로서, 사실 탐색, 인용 근거 확인, 보고서 종합에서 강력한 능력을 발휘합니다. QUEST를 구축하기 위해, 우리는 중간 학습(mid-training), 지도 미세 조정, 강화 학습을 결합한 효과적인 학습 레시피를 제안합니다. 이 레시피의 핵심은 통합 루브릭 트리(unified rubric trees)에 기반한 큐레이션된 데이터 합성 파이프라인으로, 이는 다양한 작업 유형에 적용 가능하며 사람의 주석 없이 검증 가능한 보상이 있는 학습 데이터를 합성할 수 있게 합니다. 또한 QUEST는 효과적인 장기 추론과 지식 종합을 가능하게 하는 내장 컨텍스트 관리 메커니즘을 통합하고 있습니다. 단 8K 개의 합성 작업만을 사용하여, QUEST는 다양한 작업 유형을 포괄하는 8개의 심층 연구 벤치마크에서 최첨단 폐쇄형 소스 에이전트에 근접하거나 심지어 능가하며, 최근 공개 가중치 에이전트 중 최고의 전반적 성능을 달성합니다. 우리는 모델, 데이터, 학습 스크립트 등 모든 것을 공개했습니다.

English

Deep research agents extend the role of search engines from retrieving keyword-matched pages to synthesizing knowledge, fundamentally changing how humans interact with information. However, frontier systems remain proprietary, while existing open agents often generalize poorly across different task types, leaving unclear how to train a broadly capable deep research agent. We release QUEST, a family of open models (ranging from 2B to 35B) that serve as general-purpose deep research agents designed to handle a wide range of long-horizon search tasks, with strong capabilities in fact seeking, citation grounding, and report synthesis. To build QUEST, we propose an effective training recipe combining mid-training, supervised fine-tuning, and reinforcement learning. Central to this recipe is a curated data synthesis pipeline based on unified rubric trees, which applies to different task types and enables synthesizing training data with verifiable rewards without human annotation. In addition, QUEST incorporates a built-in context management mechanism that enables effective long-horizon reasoning and knowledge synthesis. Using only 8K synthesized tasks, QUEST approaches or even surpasses frontier closed-source agents across eight deep research benchmarks spanning diverse task types, and achieves the best overall performance among recent open-weight agents. We released everything: models, data, and training scripts.