Apriel-1.5-15b-Thinker

초록

우리는 Apriel-1.5-15B-Thinker를 소개합니다. 이 모델은 150억 개의 파라미터를 가진 오픈 웨이트 멀티모달 추론 모델로, 단순한 규모가 아닌 훈련 설계를 통해 최첨단 수준의 성능을 달성합니다. Pixtral-12B를 시작점으로 삼아, 우리는 점진적인 3단계 방법론을 적용했습니다: (1) 처음부터 사전 훈련을 하지 않고도 추론 능력을 확장하기 위한 깊이 확장, (2) 기초적인 텍스트 및 시각 이해를 먼저 개발한 후, 공간 구조, 구성적 이해, 세밀한 인식을 다루는 표적 합성 데이터 생성을 통해 시각 추론을 강화하는 단계적 지속 사전 훈련, 그리고 (3) 수학, 코딩, 과학, 도구 사용에 걸친 명시적 추론 흔적을 포함한 선별된 지시-응답 쌍에 대한 고품질 텍스트 전용 지도 미세 조정. 특히, 우리의 모델은 강화 학습이나 선호 최적화 없이도 경쟁력 있는 결과를 달성하여, 데이터 중심의 지속 사전 훈련 접근법의 기여를 분리해냈습니다. Artificial Analysis Intelligence Index에서 Apriel-1.5-15B-Thinker는 52점을 획득하여 DeepSeek-R1-0528과 동등한 성적을 거두었으며, 이는 상당히 적은 계산 자원을 필요로 합니다. 10개의 이미지 벤치마크에서, 이 모델의 성능은 Gemini-2.5-Flash와 Claude Sonnet-3.7과 평균적으로 5점 이내로 근접하며, 이는 단일 GPU 배포 제약 내에서 작동하는 모델로서 중요한 성과입니다. 우리의 결과는 신중한 중간 훈련 설계가 대규모 규모 없이도 상당한 능력 격차를 해결할 수 있음을 보여주며, 제한된 인프라를 가진 조직들에게도 최첨단 멀티모달 추론을 접근 가능하게 만듭니다. 우리는 오픈소스 연구를 진흥하기 위해 모델 체크포인트, 모든 훈련 레시피, 평가 프로토콜을 MIT 라이선스 하에 공개합니다.

English

We present Apriel-1.5-15B-Thinker, a 15-billion parameter open-weights multimodal reasoning model that achieves frontier-level performance through training design rather than sheer scale. Starting from Pixtral-12B, we apply a progressive three-stage methodology: (1) depth upscaling to expand reasoning capacity without pretraining from scratch, (2) staged continual pre-training that first develops foundational text and vision understanding, then enhances visual reasoning through targeted synthetic data generation addressing spatial structure, compositional understanding, and fine-grained perception, and (3) high-quality text-only supervised fine-tuning on curated instruction-response pairs with explicit reasoning traces spanning mathematics, coding, science, and tool use. Notably, our model achieves competitive results without reinforcement learning or preference optimization, isolating the contribution of our data-centric continual pre-training approach. On the Artificial Analysis Intelligence Index, Apriel-1.5-15B-Thinker attains a score of 52, matching DeepSeek-R1-0528 despite requiring significantly fewer computational resources. Across ten image benchmarks, its performance is on average within five points of Gemini-2.5-Flash and Claude Sonnet-3.7, a key achievement for a model operating within single-GPU deployment constraints. Our results demonstrate that thoughtful mid-training 2 design can close substantial capability gaps without massive scale, making frontier-level multimodal reasoning accessible to organizations with limited infrastructure. We release the model checkpoint, all training recipes, and evaluation protocols under the MIT license to to advance open-source research.