Apriel-Reasoner:面向通用高效推理的强化学习后训练方法
Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning
April 2, 2026
作者: Rafael Pardinas, Ehsan Kamalloo, David Vazquez, Alexandre Drouin
cs.AI
摘要
前沿开源模型普遍采用基于可验证奖励的强化学习(RLVR)技术构建跨领域通用推理模型,但其训练方案与领域混合策略往往不予公开。跨领域联合优化面临显著挑战:各领域在推演长度、问题难度和样本效率方面差异巨大。此外,长链思维轨迹会加剧推理成本与延迟,使得效率成为实际部署的关键。我们提出Apriel-Reasoner,基于150亿参数开源大模型Apriel-Base,采用完全可复现的多领域RL后训练方案,在数学、代码生成、指令遵循、逻辑谜题和函数调用五个公共数据集领域进行训练。我们引入自适应领域采样机制,在异质化推演动态下保持目标领域比例;并提出标准长度惩罚的难度感知扩展方案,无需额外训练开销即可促使模型对难题延长推理、对易题缩短轨迹。在严格遵循16K令牌输出预算的训练条件下,Apriel-Reasoner在推理时能泛化至32K令牌,在AIME 2025、GPQA、MMLU-Pro和LiveCodeBench基准上超越Apriel-Base,同时生成缩短30-50%的推理轨迹。该模型以更低令牌成本达到同规模强开源模型水平,从而推进了准确率与令牌预算的帕累托前沿。
English
Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-Reasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.