OptiMer: 지속적 사전 학습을 위한 데이터 혼합보다 우수한 최적 분포 벡터 병합

초록

지속적 사전 학습은 대규모 언어 모델을 대상 언어와 도메인에 적응시키는 데 널리 사용되지만, 학습 데이터의 혼합 비율은 조정 비용이 많이 드는 민감한 하이퍼파라미터로 남아 있습니다: 이 비율은 학습 시작 전에 고정되어야 하며, 최적이 아닌 선택은 수 주 간의 컴퓨팅 자원을 낭비할 수 있습니다. 본 연구에서는 비율 선택을 학습 과정으로부터 분리하는 OptiMer를 제안합니다. 우리는 데이터셋별로 하나의 CPT 모델을 학습하고, 각 모델이 데이터셋에 의해 유도된 매개변수 변화를 나타내는 분포 벡터를 추출한 후, 베이지안 최적화를 통해 사후적으로 최적의 조합 가중치를 탐색합니다. 언어(일본어, 중국어) 및 도메인(수학, 코드)에 걸쳐 Gemma 3 27B를 대상으로 한 실험에서 OptiMer는 데이터 혼합 및 모델 평균화 기준선을 일관되게 능가하면서 탐색 비용을 15-35배 낮췄습니다. 주요 결과는 1) 최적화된 가중치가 데이터 혼합 비율로 해석될 수 있으며, 이 비율로 재학습 시 데이터 혼합 CPT의 성능을 향상시킬 수 있고, 2) 동일한 벡터 풀을 재학습 없이 주어진 목표에 대해 재최적화하여 수요에 따라 대상에 맞춤화된 모델을 생성할 수 있음을 보여줍니다. 우리의 연구는 전통적으로 사전 학습 전에 결정되던 데이터 혼합 비율 선택이 분포 벡터에 대한 사후 최적화 문제로 재구성될 수 있음을 입증하며, 지속적 사전 학습을 위한 더 유연한 패러다임을 제시합니다.

English

Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model's distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.

OptiMer: 지속적 사전 학습을 위한 데이터 혼합보다 우수한 최적 분포 벡터 병합

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

초록

Support