OptiMer：持續預訓練中最佳化分佈向量合併勝於資料混合

摘要

持續預訓練被廣泛用於使大語言模型適應目標語言和領域，但訓練數據的混合比例仍是需要昂貴調參的敏感超參數：這些比例必須在訓練開始前確定，而次優選擇可能浪費數週計算資源。本研究提出OptiMer方法，將比例選擇與訓練過程解耦：我們為每個數據集訓練一個CPT模型，提取各模型的分布向量（反映該數據集誘導的參數偏移），並通過貝葉斯優化進行事後優化組合權重搜索。在Gemma 3 27B模型上針對多語言（日語、中文）和多領域（數學、編程）的實驗表明，OptiMer始終優於數據混合和模型平均基線，且搜索成本降低15-35倍。關鍵發現包括：1）優化權重可解讀為數據混合比例，使用這些比例重新訓練能改進數據混合CPT效果；2）同一向量池可針對特定目標重新優化而無需額外訓練，實現按需生成定制模型。我們的工作證明，傳統上屬於訓練前決策的數據混合比例選擇，可重構為基於分布向量的事後優化問題，為持續預訓練提供了更靈活的範式。

English

Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model's distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.

OptiMer：持續預訓練中最佳化分佈向量合併勝於資料混合

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

摘要

Support