**Apriel-Reasoner: 범용적이고 효율적인 추론을 위한 강화학습 기반 사후 학습**

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)을 통해 다양한 도메인에서 범용 추론 모델을 구축하는 방식은 최신 오픈 웨이트 모델들 사이에서 널리 채택되고 있습니다. 그러나 이들의 훈련 방법과 도메인 혼합 비율은 공개되지 않는 경우가 많습니다. 여러 도메인을 아우르는 공동 최적화는 롤아웃 길이, 문제 난이도, 샘플 효율성 등에서 도메인 간 차이가 크기 때문에 상당한 어려움을 야기합니다. 더욱이 긴 사고 흔적(Chain-of-Thought)을 가진 모델은 추론 비용과 지연 시간을 증가시켜 실용적인 배포에 있어 효율성이 중요해집니다. 본 논문에서는 150억 개의 매개변수를 가진 오픈 웨이트 LLM인 Apriel-Base를 바탕으로, 공개 데이터셋을 사용한 수학, 코드 생성, 지시 따르기, 논리 퍼즐, 함수 호출 등 5개 도메인에 걸쳐 완전히 재현 가능한 다중 도메인 RL 후속 훈련 방법으로 학습된 Apriel-Reasoner를 제시합니다. 우리는 이질적인 롤아웃 역학에도 불구하고 목표 도메인 비율을 유지하는 적응형 도메인 샘플링 메커니즘과, 추가 훈련 오버헤드 없이 어려운 문제에 대해서는 더 긴 추론을, 쉬운 문제에 대해서는 더 짧은 추론 흔적을 유도하는 표준 길이 패널티의 난이도 인식 확장 기법을 도입했습니다. 16K 토큰의 엄격한 출력 제한으로 훈련된 Apriel-Reasoner는 추론 시 32K 토큰으로 일반화되며, AIME 2025, GPQA, MMLU-Pro, LiveCodeBench에서 Apriel-Base 대비 성능을 향상시키면서 추론 흔적 길이는 30-50% 단축했습니다. 이는 유사 규모의 강력한 오픈 웨이트 모델들의 성능을 더 적은 토큰 비용으로 달성하여 정확도 대 토큰 예산의 파레토 최적선을 앞당깁니다.

English

Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-Reasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.

Apriel-Reasoner: 범용적이고 효율적인 추론을 위한 강화학습 기반 사후 학습

Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning

초록

Support