Apriel-Reasoner: 汎用かつ効率的な推論のための強化学習による事後学習

要旨

検証可能な報酬を用いた強化学習（RLVR）による汎用推論モデルの構築は、先進的なオープンウェイトモデルにおいて広く採用されている。しかし、そのトレーニングレシピやドメイン混合は開示されないことが多い。複数ドメインにわたる共同最適化には重大な課題がある。ドメインによってロールアウトの長さ、問題の難易度、サンプル効率が大きく異なるためである。さらに、長い思考連鎖を持つモデルは推論コストとレイテンシを増大させ、実用展開には効率性が極めて重要となる。本論文では、Apriel-Base（150億パラメータのオープンウェイトLLM）に対し、公開データセットを用いた数学、コード生成、指示追従、論理パズル、関数呼び出しの5ドメインにわたる完全再現可能なマルチドメインRL事後学習レシピでトレーニングしたApriel-Reasonerを提案する。不均一なロールアウト動態にもかかわらず目標ドメイン比率を維持する適応的ドメインサンプリング機構と、追加のトレーニングオーバーヘッドなしで、難易度の高い問題ではより長い推論を、容易な問題では短い思考痕跡を促進する標準的な長さペナルティの難易度考慮拡張を導入する。厳格な16Kトークンの出力予算でトレーニングされたApriel-Reasonerは、推論時に32Kトークンまで一般化し、AIME 2025、GPQA、MMLU-Pro、LiveCodeBenchにおいてApriel-Baseを上回り、かつ30-50%短い思考痕跡を生成する。これにより、同規模の強力なオープンウェイトモデルと同等の精度を、より少ないトークンコストで達成し、精度対トークン予算のパレートフロンティアを推進する。

English

Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-Reasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.

Apriel-Reasoner: 汎用かつ効率的な推論のための強化学習による事後学習

Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning

要旨

Support