마르코 딥리서처: 검증 중심 설계를 통한 효율적 딥 리서치 에이전트 개발

초록

딥 리서치 에이전트는 현실 세계의 문제를 해결하기 위해 다양한 출처에 걸친 복잡한 정보 검색과 다단계 추론을 통합하여 개방형 탐구를 자율적으로 수행합니다. 이러한 능력을 장기적인 과제에서 지속하기 위해서는 학습과 추론 과정 모두에서 신뢰할 수 있는 검증이 중요합니다. 기존 패러다임의 주요 병목 현상은 QA 데이터 합성, 경로 구성, 테스트 시 확장 과정에서 명시적인 검증 메커니즘이 부족하다는 점에서 비롯됩니다. 각 단계에서 발생하는 오류는 하류로 전파되어 전체 에이전트 성능을 저하시킵니다. 이를 해결하기 위해 우리는 세 가지 수준에서 검증 중심 프레임워크 설계로 최적화된 딥 리서치 에이전트인 Marco DeepResearch를 제시합니다: (1) QA 데이터 합성: 그래프 기반 및 에이전트 기반 QA 합성에 검증 메커니즘을 도입하여 질문 난이도를 통제함과 동시에 답변이 고유하고 정확하도록 보장합니다; (2) 경로 구성: 명시적인 검증 패턴을 학습 경로에 주입하는 검증 주도 경로 합성 방법을 설계합니다; (3) 테스트 시 확장: 추론 시점에 Marco DeepResearch 자체를 검증자로 사용하여 도전적인 질문에 대한 성능을 효과적으로 향상시킵니다. 광범위한 실험 결과는 우리가 제안한 Marco DeepResearch 에이전트가 BrowseComp 및 BrowseComp-ZH와 같은 가장 도전적인 벤치마크에서 8B 규모 딥 리서치 에이전트들을 크게 능가함을 보여줍니다. 특히, 최대 600회의 도구 호출 예산 하에서 Marco DeepResearch는 Tongyi DeepResearch-30B와 같은 여러 30B 규모 에이전트들을 능가하거나 근접한 성능을 보였습니다.

English

Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems. To sustain this capability on long-horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test-time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: (1)~QA Data Synthesis: We introduce verification mechanisms to graph-based and agent-based QA synthesis to control question difficulty while ensuring answers are unique and correct; (2)~Trajectory Construction: We design a verification-driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and (3)~Test-time scaling: We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B-scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp-ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B-scale agents, like Tongyi DeepResearch-30B.

마르코 딥리서처: 검증 중심 설계를 통한 효율적 딥 리서치 에이전트 개발

Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design

초록

Support