更短但不逊色：通过易样本作为数学推理验证中的长度正则化器实现精简推理

摘要

针对逐步推理训练的大语言模型常因过度冗长而增加推理成本。基于可验证奖励的标准强化学习流程会过滤"简单"问题以提升训练效率，导致模型主要在需要长推理链的难题上训练。这会扭曲输出长度分布，使模型混淆"更长思考"与"更好思考"。本研究表明，保留并适度加权中等难度问题可形成隐式长度正则化。让模型接触可解决的短链任务能约束其输出分布，防止冗长失控。由此实现无需显式长度惩罚的"免费简洁性"：模型在解决难题时不膨胀输出长度。基于Qwen3-4B-Thinking-2507（16k令牌限制）的RLVR实验显示，该方法在保持基准pass@1 AIME25精度的同时，生成方案平均缩短近半。代码详见https://github.com/MBZUAI-Paris/Frugal-AI，数据集与模型发布于https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc。

English

Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out ``easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a model that conflates ``thinking longer'' with ``thinking better''. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is \emph{emergent brevity for free}: the model learns to solve harder problems without inflating the output length, despite the absence of any explicit length penalization. RLVR experiments using this approach on Qwen3-4B-Thinking-2507 (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at https://github.com/MBZUAI-Paris/Frugal-AI{GitHub}, with datasets and models on https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc{Hugging Face}.

更短但不逊色：通过易样本作为数学推理验证中的长度正则化器实现精简推理

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

摘要

Support