OptiMer：持续预训练中优化分布向量融合优于数据混合

摘要

持续预训练被广泛用于使大语言模型适应目标语言和领域，但训练数据的混合比例仍是需要高昂调参成本的敏感超参数：这些比例必须在训练开始前固定，而次优选择可能浪费数周算力。本研究提出OptiMer方法，将比例选择与训练过程解耦：我们为每个数据集训练一个CPT模型，提取各模型的分布向量（表征该数据集引发的参数偏移），随后通过贝叶斯优化进行事后最优组合权重搜索。基于Gemma 3 27B模型在语言（日语、中文）和领域（数学、代码）上的实验表明，OptiMer在搜索成本降低15-35倍的同时，持续优于数据混合和模型平均基线方法。关键发现包括：1）优化后的权重可解释为数据混合比例，使用这些比例重新训练能提升数据混合CPT效果；2）同一向量池可根据特定目标重复优化而无需重新训练，实现按需生成定制化模型。我们的工作证明，传统上属于训练前决策的数据混合比例选择，可重构为基于分布向量的事后优化问题，为持续预训练提供了更灵活的新范式。

English

Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model's distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.