Klear-Reasoner: 그래디언트 보존 클리핑 정책 최적화를 통한 추론 능력 향상

초록

우리는 장기 추론 능력을 갖춘 Klear-Reasoner 모델을 소개합니다. 이 모델은 문제 해결 과정에서 신중한 고민을 보여주며, 여러 벤치마크에서 뛰어난 성능을 달성했습니다. 현재 커뮤니티에는 추론 모델과 관련된 많은 훌륭한 연구가 있지만, 훈련 세부 사항의 불완전한 공개로 인해 고성능 추론 모델의 재현에는 여전히 많은 문제가 있습니다. 이 보고서는 데이터 준비와 장기 Chain-of-Thought 지도 미세 조정(long CoT SFT)부터 강화 학습(RL)에 이르는 전체 사후 훈련 워크플로를 포함하여 추론 모델에 대한 심층 분석을 제공하며, 각 실험 구성 요소에 대한 상세한 절제 연구를 다룹니다. SFT 데이터의 경우, 실험 결과 소수의 고품질 데이터 소스가 다양한 데이터 소스의 대량보다 더 효과적이며, 정확도 필터링 없이도 어려운 샘플이 더 나은 결과를 달성할 수 있음을 보여줍니다. 또한, RL에서 현재 클리핑 메커니즘의 두 가지 주요 문제를 조사했습니다: 클리핑이 중요한 탐색 신호를 억제하고 최적이 아닌 궤적을 무시한다는 점입니다. 이러한 문제를 해결하기 위해, 클리핑된 토큰에서 그래디언트를 부드럽게 역전파하는 Gradient-Preserving Clipping Policy Optimization(GPPO)을 제안합니다. GPPO는 모델의 탐색 능력을 강화할 뿐만 아니라 부정적 샘플로부터 학습하는 효율성도 향상시킵니다. Klear-Reasoner는 수학과 프로그래밍에서 탁월한 추론 능력을 보여주며, AIME 2024에서 90.5%, AIME 2025에서 83.2%, LiveCodeBench V5에서 66.0%, LiveCodeBench V6에서 58.1%의 점수를 기록했습니다.

English

We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model's exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5\% on AIME 2024, 83.2\% on AIME 2025, 66.0\% on LiveCodeBench V5 and 58.1\% on LiveCodeBench V6.

Klear-Reasoner: 그래디언트 보존 클리핑 정책 최적화를 통한 추론 능력 향상

Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization

초록

Support