PRISM: 프로세스 보상 모델 기반 추론을 통한 심층 사고의 한계 확장

초록

DEEPTHINK 방법론은 후보 해법 집단을 생성, 정제, 통합함으로써 추론 능력을 향상시켜 복잡한 수학 및 과학 과제에서 강력한 성능을 발휘합니다. 그러나 기존 프레임워크는 추론 과정에서 신뢰할 수 있는 정확도 신호가 부족해, 집단 정제 병목 현상이 발생합니다. 이로 인해 심화된 숙고가 오류를 증폭시키고, 정확한 소수 해법을 억제하며, 추가 계산 투자에 대한 성능 향상 효과가 미미해집니다. 본 논문에서는 DEEPTHINK 시스템의 기능적 분해를 소개하고, 단계별 검증을 통해 집단 정제와 해법 통합을 모두 안내하는 Process Reward Model(PRM) 기반 추론 알고리즘인 PRISM을 제안합니다. 정제 과정에서 PRISM은 후보 해법을 PRM이 정의한 에너지 풍경 내 입자로 간주하고, 점수 기반 재샘플링과 확률적 정제를 통해 집단을 재구성함으로써 확률 질량을 더 높은 품질의 추론 과정에 집중시키면서도 다양성을 유지합니다. 수학 및 과학 벤치마크에서 PRISM은 기존 DEEPTHINK 방법론과 비교하여 경쟁력 있거나 더 우수한 성능을 보였으며, gpt-oss-20b를 사용하여 AIME25에서 90.0%, HMMT25에서 75.4%, GPQA Diamond에서 71.4%를 달성하면서 gpt-oss-120b 성능에 버금가거나 이를 능가했습니다. 또한 우리의 분석 결과, PRISM은 정제 과정에서 지속적인 순방향 오류 수정을 수행하며, 초기 집단에 정확한 후보가 적게 포함된 경우에도 안정적이고, 종종 계산-정확도 파레토 최적선에 도달함을 보여줍니다.

English

DEEPTHINK methods improve reasoning by generating, refining, and aggregating populations of candidate solutions, which enables strong performance on complex mathematical and scientific tasks. However, existing frameworks often lack reliable correctness signals during inference, which creates a population-enhancement bottleneck where deeper deliberation amplifies errors, suppresses correct minority solutions, and yields weak returns to additional compute. In this paper, we introduce a functional decomposition of DEEPTHINK systems and propose PRISM, a Process Reward Model (PRM)-guided inference algorithm that uses step-level verification to guide both population refinement and solution aggregation. During refinement, PRISM treats candidate solutions as particles in a PRM-defined energy landscape and reshapes the population through score-guided resampling and stochastic refinement, which concentrates probability mass on higher-quality reasoning while preserving diversity. Across mathematics and science benchmarks, PRISM is competitive with or outperforms existing DEEPTHINK methods, reaching 90.0%, 75.4%, and 71.4% with gpt-oss-20b on AIME25, HMMT25, and GPQA Diamond, respectively, while matching or exceeding gpt-oss-120b. Additionally, our analysis shows that PRISM produces consistent net-directional correction during refinement, remains reliable when the initial population contains few correct candidates, and often lies on the compute-accuracy Pareto frontier.

PRISM: 프로세스 보상 모델 기반 추론을 통한 심층 사고의 한계 확장

PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference

초록

Support