OptiMer: 連続事前学習におけるデータ混合よりも最適な分布ベクトル統合

要旨

継続事前学習は大規模言語モデルを対象言語やドメインに適応させるために広く用いられているが、訓練データの混合比率は調整コストが高く敏感なハイパーパラメータであり続けている。これらの比率は訓練開始前に固定する必要があり、最適でない選択は数週間分の計算リソースを浪費する可能性がある。本研究では、OptiMerを提案する。この手法は比率選択と訓練を分離するものであり、データセットごとに一つずつCPTモデルを訓練し、各モデルからデータセットによって誘起されたパラメータ変化を表す分布ベクトルを抽出し、ベイズ最適化を用いて事後的（ポストホック）に最適な合成重みを探索する。Gemma 3 27Bを用いた言語（日本語、中国語）およびドメイン（数学、コード）にわたる実験により、OptiMerがデータ混合およびモデル平均化のベースラインを、探索コストを15～35分の1に抑えつつ一貫して上回ることを示した。主な発見は以下の通りである：1) 最適化された重みはデータ混合比率として解釈可能であり、この比率で再訓練するとデータ混合CPTが改善される、2) 同じベクトルプールを再訓練なしに所与の目的関数に対して再最適化でき、要求に応じてターゲットに特化したモデルを生成できる。本研究は、伝統的に訓練前の決定事項であったデータ混合比率選択が、分布ベクトルに対する事後的（ポストホック）最適化として再定式化可能であることを示し、継続事前学習のためのより柔軟なパラダイムを提供する。

English

Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model's distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.

OptiMer: 連続事前学習におけるデータ混合よりも最適な分布ベクトル統合

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

要旨

Support