적은 사고로 더 많은 샘플링: 간결한 추론을 위한 그룹 필터링 정책 최적화

초록

검증 가능한 보상을 사용한 강화 학습으로 훈련된 대형 언어 모델은 정확도를 높이기 위해 응답 길이를 부풀리는 경향이 있습니다. 더 어려운 문제의 경우 긴 답변이 필요할 수 있지만, 많은 토큰이 단순히 "채우기" 역할을 하는 경우가 많습니다: 반복적이고 장황한 텍스트로 실제 진전을 이루지 못하는 경우입니다. 우리는 GFPO(Group Filtered Policy Optimization)를 소개합니다. 이 방법은 훈련 중에 문제당 더 큰 그룹을 샘플링하고 두 가지 주요 지표를 기반으로 응답을 필터링하여 훈련함으로써 이러한 길이 폭증을 억제합니다: (1) 응답 길이와 (2) 토큰 효율성(토큰당 보상 비율). 훈련 시간에 더 많이 샘플링함으로써, 우리는 모델이 추론 시간에 덜 생각하도록 가르칩니다. Phi-4-reasoning 모델에서 GFPO는 GRPO의 길이 폭증을 도전적인 STEM 및 코딩 벤치마크(AIME 24/25, GPQA, Omni-MATH, LiveCodeBench)에서 46-71% 줄이면서도 정확도를 유지합니다. 토큰당 보상을 최적화하면 길이 폭증 감소가 71-85%로 더욱 증가합니다. 또한, 우리는 Adaptive Difficulty GFPO를 제안합니다. 이 방법은 실시간 난이도 추정을 기반으로 더 어려운 문제에 더 많은 훈련 자원을 동적으로 할당하여, 특히 어려운 질문에서 계산 효율성과 정확성 사이의 균형을 개선합니다. GFPO는 훈련 시간 계산량의 증가가 테스트 시간 계산량의 감소로 직접 이어짐을 보여줍니다. 이는 효율적인 추론을 위한 간단하지만 효과적인 절충안입니다.

English

Large language models trained with reinforcement learning with verifiable rewards tend to trade accuracy for length--inflating response lengths to achieve gains in accuracy. While longer answers may be warranted for harder problems, many tokens are merely "filler": repetitive, verbose text that makes no real progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem during training and filtering responses to train on based on two key metrics: (1) response length and (2) token efficiency: reward per token ratio. By sampling more at training time, we teach models to think less at inference time. On the Phi-4-reasoning model, GFPO cuts GRPO's length inflation by 46-71% across challenging STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while maintaining accuracy. Optimizing for reward per token further increases reductions in length inflation to 71-85%. We also propose Adaptive Difficulty GFPO, which dynamically allocates more training resources to harder problems based on real-time difficulty estimates, improving the balance between computational efficiency and accuracy especially on difficult questions. GFPO demonstrates that increased training-time compute directly translates to reduced test-time compute--a simple yet effective trade-off for efficient reasoning.

적은 사고로 더 많은 샘플링: 간결한 추론을 위한 그룹 필터링 정책 최적화

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

초록

Support