DeepSearch: 몬테카를로 트리 탐색을 통한 검증 가능한 보상으로 강화 학습의 병목 현상 극복

초록

RLVR(강화 학습 기반 추론)은 대규모 언어 모델(LLMs)의 고급 추론 능력 개발에 필수적인 요소로 자리 잡았지만, 최근 연구들은 수천 번의 최적화 단계 이후 나타나는 학습 정체 현상을 보고하며, 계산 자원 투자가 증가함에도 불구하고 성능 향상이 현저히 감소하는 것을 보여주었다. 이러한 한계는 현재 RLVR 방법론에서 내재된 희소 탐색 패턴에서 비롯된다. 이는 모델이 제한된 롤아웃에 의존하여 종종 중요한 추론 경로를 놓치고 해결 공간을 체계적으로 커버하지 못하기 때문이다. 본 연구에서는 몬테카를로 트리 탐색(Monte Carlo Tree Search)을 RLVR 훈련에 직접 통합한 DeepSearch 프레임워크를 제안한다. 기존 방법들이 추론 단계에서만 트리 탐색을 활용하는 것과 달리, DeepSearch는 구조화된 탐색을 훈련 루프에 내재시켜 체계적인 탐색과 추론 단계 간 세밀한 신용 할당을 가능하게 한다. 훈련 시간 탐색을 통해 DeepSearch는 장기간의 훈련 단계에서 성능 향상이 감소하는 근본적인 병목 현상인 불충분한 탐색 문제를 해결한다. 본 연구의 주요 기여는 다음과 같다: (1) 탐색 트리 전반에 걸쳐 유망한 노드를 우선적으로 선택하는 전역 프론티어 선택 전략, (2) 확신 있는 경로를 식별하여 지도 학습을 위한 엔트로피 기반 가이던스를 통한 선택, (3) 효율성을 위한 솔루션 캐싱과 적응형 리플레이 버퍼 훈련. 수학적 추론 벤치마크에서의 실험 결과, DeepSearch는 평균 62.95%의 정확도를 달성하며 1.5B 추론 모델에서 새로운 최첨단 기술을 확립했고, 확장된 훈련 접근법보다 5.7배 적은 GPU 시간을 사용했다. 이러한 결과는 무작위 확장보다 전략적 탐색의 중요성을 강조하며, RLVR 방법론을 발전시키기 위한 알고리즘 혁신의 가능성을 보여준다. DeepSearch는 장기간의 계산보다 체계적인 탐색을 통해 추론 능력을 확장하는 새로운 방향을 제시한다.

English

Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.

DeepSearch: 몬테카를로 트리 탐색을 통한 검증 가능한 보상으로 강화 학습의 병목 현상 극복

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

초록

Support