PRISM: 중간 훈련에서의 보존과 상호작용에 대한 이해

초록

우리는 대규모 언어 모델의 중간 훈련(mid-training) 설계 선택에 대한 포괄적인 실증 연구인 PRISM을 제시한다. 4개 모델 패밀리(Granite, LLaMA, Mistral, Nemotron-H), 2가지 아키텍처 유형(밀집 트랜스포머와 어텐션-맘바 하이브리드), 3B에서 24B 매개변수 규모에 이르는 7개의 기본 모델을 대상으로 한 통제 실험을 통해, 약 270억 개의 고품질 토큰을 활용한 중간 훈련이 일반 성능을 유지하면서 수학 벤치마크에서 +15~+40점, 코드에서 +5~+12점, 과학 벤치마크에서 +6~+13점의 일관된 성능 향상을 가져온다는 것을 보여준다. 완전한 PRISM to RL 파이프라인은 6가지 추론 벤치마크의 매크로 평균을 12점 미만에서 29-42점(3-4배 향상)으로 개선한 반면, 대부분의 기본 모델에 RL을 직접 적용하는 것은 여전히 상당히 효과가 낮았으며 AIME 점수가 0점에 가까웠다. 데이터 구성은 RL 단계가 아닌 중간 훈련 단계에서 가장 중요하다: 중간 훈련 시 과학 데이터를 포함하면 RL 단계에서 GPQA-Diamond 점수가 +17~+28점 향상되는 반면, RL 혼합 비율을 변경해도 2점 미만의 차이만 발생한다. 메커니즘적으로 중간 훈련은 모델 가중치의 90% 이상을 밀집 재구성하는 반면, RL은 약 5%의 매개변수에 대해 희소하고 전위 중심의 미세 조정을 수행한다. 표현 분석(CKA)은 RL이 아키텍처에 관계없이 중간 훈련의 표현 기하학을 일관되게 보존함(0.998 이상 CKA)을 확인한다. 중요한 것은 RL은 시작점에 관계없이 동일한 가중치 변화를 적용하지만, 중간 훈련을 거친 모델에서만 성공하는데, 이는 중간 훈련이 RL이 효과적으로 성능을 개선할 수 있는 구성으로 모델을 배치하기 때문이다. 우리의 결과는 보존 인식 중간 훈련(retention-aware mid-training)이 신뢰할 수 있는 추론 능력 향상에 매우 효과적이며, 견고한 중간 훈련 파이프라인 설계를 위한 실용적인 지침을 제공함을 입증한다.

English

We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training's representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.

PRISM: 중간 훈련에서의 보존과 상호작용에 대한 이해

PRISM: Demystifying Retention and Interaction in Mid-Training

초록

Support