DeepPrune: 트레이스 간 중복 없이 병렬 스케일링 구현

초록

병렬 스케일링은 다수의 사고 연쇄(Chain-of-Thought, CoT) 추적을 동시에 생성함으로써 대규모 언어 모델(LLMs)의 추론 능력을 향상시키는 강력한 패러다임으로 부상했다. 그러나 이 접근법은 추적 간 중복으로 인해 상당한 계산 비효율성을 초래한다. 우리의 분석에 따르면, 병렬 추론 추적의 80% 이상이 동일한 최종 답변을 생성하며, 이는 상당한 계산 낭비를 나타낸다. 이러한 중요한 효율성 병목 현상을 해결하기 위해, 우리는 동적 가지치기를 통해 효율적인 병렬 스케일링을 가능하게 하는 새로운 프레임워크인 DeepPrune을 제안한다. 우리의 방법은 부분 추론 추적로부터 답변 동등성을 정확하게 예측하기 위해 포커스 손실(focal loss)과 오버샘플링 기법으로 훈련된 전문 판단 모델을 특징으로 하며, 이는 동등성 예측에서 0.87 AUROC를 달성한다. 또한, 온라인 탐욕적 클러스터링 알고리즘을 결합하여 답변 다양성을 유지하면서 중복 경로를 동적으로 제거한다. 세 가지 도전적인 벤치마크(AIME 2024, AIME 2025, GPQA)와 다중 추론 모델에 대한 포괄적인 평가를 통해 DeepPrune은 대부분의 경우 기존의 합의 샘플링(consensus sampling) 대비 80% 이상의 토큰 감소를 달성하면서도 3% 포인트 이내의 경쟁력 있는 정확도를 유지함을 입증했다. 우리의 연구는 고성능 추론을 더 효율적으로 만드는 효율적인 병렬 추론의 새로운 기준을 확립한다. 우리의 코드와 데이터는 여기에서 확인할 수 있다: https://deepprune.github.io/

English

Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy -- our analysis reveals that over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with focal loss and oversampling techniques to accurately predict answer equivalence from partial reasoning traces which realizes 0.87 AUROC on equivalence prediction, combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction by over 80% compared to conventional consensus sampling on most cases, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient. Our code and data are here: https://deepprune.github.io/

DeepPrune: 트레이스 간 중복 없이 병렬 스케일링 구현

DeepPrune: Parallel Scaling without Inter-trace Redundancy

초록

Support