更短但不逊色:通过易样本作为数学推理验证中的长度正则化器实现精简推理
Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR
November 2, 2025
作者: Abdelaziz Bounhar, Hadi Abdine, Evan Dufraisse, Ahmad Chamma, Amr Mohamed, Dani Bouch, Michalis Vazirgiannis, Guokan Shang
cs.AI
摘要
针对逐步推理训练的大语言模型常因过度冗长而增加推理成本。基于可验证奖励的标准强化学习流程会过滤"简单"问题以提升训练效率,导致模型主要在需要长推理链的难题上训练。这会扭曲输出长度分布,使模型混淆"更长思考"与"更好思考"。本研究表明,保留并适度加权中等难度问题可形成隐式长度正则化。让模型接触可解决的短链任务能约束其输出分布,防止冗长失控。由此实现无需显式长度惩罚的"免费简洁性":模型在解决难题时不膨胀输出长度。基于Qwen3-4B-Thinking-2507(16k令牌限制)的RLVR实验显示,该方法在保持基准pass@1 AIME25精度的同时,生成方案平均缩短近半。代码详见https://github.com/MBZUAI-Paris/Frugal-AI,数据集与模型发布于https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc。
English
Large language models (LLMs) trained for step-by-step reasoning often become
excessively verbose, raising inference cost. Standard Reinforcement Learning
with Verifiable Rewards (RLVR) pipelines filter out ``easy'' problems for
training efficiency, leaving the model to train primarily on harder problems
that require longer reasoning chains. This skews the output length distribution
upward, resulting in a model that conflates ``thinking longer'' with
``thinking better''. In this work, we show that retaining and modestly
up-weighting moderately easy problems acts as an implicit length regularizer.
Exposing the model to solvable short-chain tasks constrains its output
distribution and prevents runaway verbosity. The result is
\emph{emergent brevity for free}: the model learns to solve harder
problems without inflating the output length, despite the absence of
any explicit length penalization. RLVR experiments using this approach on
Qwen3-4B-Thinking-2507 (with a 16k token limit) achieve baseline
pass@1 AIME25 accuracy while generating solutions that are, on average, nearly
twice as short. The code is available at
https://github.com/MBZUAI-Paris/Frugal-AI{GitHub}, with datasets and
models on
https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc{Hugging
Face}.