样本精简以提升思维效率：面向简洁推理的群体过滤策略优化

摘要

采用可验证奖励进行强化学习训练的大型语言模型，往往以牺牲准确性为代价换取长度——通过增加回答长度来提升准确率。虽然对于更复杂的问题，更长的回答可能是必要的，但许多标记仅仅是“填充物”：重复、冗长的文本并未实质推进问题解决。我们引入了GFPO（组过滤策略优化），通过在训练期间对每个问题采样更大的组，并基于两个关键指标筛选训练响应来遏制这种长度膨胀：(1) 响应长度和(2) 标记效率：每标记奖励比率。通过在训练时增加采样，我们教导模型在推理时减少思考。在Phi-4推理模型上，GFPO在具有挑战性的STEM和编程基准测试（AIME 24/25、GPQA、Omni-MATH、LiveCodeBench）中，将GRPO的长度膨胀减少了46-71%，同时保持了准确性。优化每标记奖励进一步将长度膨胀的减少幅度提升至71-85%。我们还提出了自适应难度GFPO，它根据实时难度估计动态分配更多训练资源给更难的问题，特别是在难题上改善了计算效率与准确性之间的平衡。GFPO证明，增加训练时的计算量直接转化为减少测试时的计算量——这是实现高效推理的一个简单而有效的权衡。

English

Large language models trained with reinforcement learning with verifiable rewards tend to trade accuracy for length--inflating response lengths to achieve gains in accuracy. While longer answers may be warranted for harder problems, many tokens are merely "filler": repetitive, verbose text that makes no real progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem during training and filtering responses to train on based on two key metrics: (1) response length and (2) token efficiency: reward per token ratio. By sampling more at training time, we teach models to think less at inference time. On the Phi-4-reasoning model, GFPO cuts GRPO's length inflation by 46-71% across challenging STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while maintaining accuracy. Optimizing for reward per token further increases reductions in length inflation to 71-85%. We also propose Adaptive Difficulty GFPO, which dynamically allocates more training resources to harder problems based on real-time difficulty estimates, improving the balance between computational efficiency and accuracy especially on difficult questions. GFPO demonstrates that increased training-time compute directly translates to reduced test-time compute--a simple yet effective trade-off for efficient reasoning.

样本精简以提升思维效率：面向简洁推理的群体过滤策略优化

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

摘要

Support