短くとも劣らず：数学的RLVRにおける容易なサンプルを長さ正則化器として用いた効率的推論

要旨

段階的な推論を目的として訓練された大規模言語モデル（LLM）は、過度に冗長になりがちで、推論コストを上昇させる。検証可能な報酬を用いた標準的な強化学習（RLVR）パイプラインでは、訓練効率のために「容易な」問題をフィルタリングし、より長い推論連鎖を必要とする難しい問題に主に訓練を集中させる。これにより、出力長の分布が上方に偏り、「より長く考えること」と「より良く考えること」を混同するモデルが生じる。本研究では、中程度に容易な問題を保持し、適度に重み付けすることが、暗黙的な長さ正則化として機能することを示す。解決可能な短い連鎖タスクをモデルに提示することで、その出力分布が制約され、制御不能な冗長性が防止される。その結果が、**追加コストなしで生じる簡潔性の創発**である：明示的な長さ罰則が一切存在しないにもかかわらず、モデルは出力長を増大させることなく、より難しい問題を解決することを学習する。このアプローチを用いたQwen3-4B-Thinking-2507（16kトークン制限）でのRLVR実験では、ベースラインのpass@1 AIME25精度を維持しつつ、平均して約2倍短い解答を生成することに成功した。コードはhttps://github.com/MBZUAI-Paris/Frugal-AI{GitHub}で、データセットとモデルはhttps://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc{Hugging Face}で公開されている。

English

Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out ``easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a model that conflates ``thinking longer'' with ``thinking better''. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is \emph{emergent brevity for free}: the model learns to solve harder problems without inflating the output length, despite the absence of any explicit length penalization. RLVR experiments using this approach on Qwen3-4B-Thinking-2507 (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at https://github.com/MBZUAI-Paris/Frugal-AI{GitHub}, with datasets and models on https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc{Hugging Face}.

短くとも劣らず：数学的RLVRにおける容易なサンプルを長さ正則化器として用いた効率的推論

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

要旨

Support