數據混合代理：學習重新加權領域以實現持續預訓練

摘要

在特定任務的小規模數據上進行持續預訓練是提升大型語言模型在新目標領域表現的有效方法，但這可能導致其原有能力的災難性遺忘。常見的解決方案是根據領域空間重新權衡源領域和目標領域的訓練數據混合比例，以實現平衡的性能。以往的領域重新權衡策略依賴於基於人類直覺或經驗結果的手動指定啟發式方法。在本研究中，我們證明了更通用的啟發式方法可以通過參數化來實現，提出了數據混合代理（Data Mixing Agent），這是首個基於模型的端到端框架，能夠學習如何重新權衡領域。該代理通過強化學習在大量數據混合軌跡及其對應的評估環境反饋中學習可泛化的啟發式方法。在數學推理的持續預訓練實驗中，數據混合代理在源領域和目標領域基準測試中實現平衡性能方面超越了強基線。此外，它在未見過的源領域、目標模型和領域空間中表現出良好的泛化能力，無需重新訓練。直接應用於代碼生成領域也顯示了其在跨目標領域的適應性。進一步分析展示了代理的啟發式方法與人類直覺的高度一致性，以及其在用更少的源領域數據實現更優模型性能方面的效率。

English

Continual pre-training on small-scale task-specific data is an effective method for improving large language models in new target fields, yet it risks catastrophic forgetting of their original capabilities. A common solution is to re-weight training data mixtures from source and target fields on a domain space to achieve balanced performance. Previous domain reweighting strategies rely on manual designation with certain heuristics based on human intuition or empirical results. In this work, we prove that more general heuristics can be parameterized by proposing Data Mixing Agent, the first model-based, end-to-end framework that learns to re-weight domains. The agent learns generalizable heuristics through reinforcement learning on large quantities of data mixing trajectories with corresponding feedback from an evaluation environment. Experiments in continual pre-training on math reasoning show that Data Mixing Agent outperforms strong baselines in achieving balanced performance across source and target field benchmarks. Furthermore, it generalizes well across unseen source fields, target models, and domain spaces without retraining. Direct application to the code generation field also indicates its adaptability across target domains. Further analysis showcases the agents' well-aligned heuristics with human intuitions and their efficiency in achieving superior model performance with less source-field data.

數據混合代理：學習重新加權領域以實現持續預訓練

Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

摘要

Support