AC-ODM:面向样本高效大语言模型预训练的演员-评论家在线数据混合方法
AC-ODM: Actor--Critic Online Data Mixing for Sample-Efficient LLM Pretraining
June 14, 2026
作者: Jing Ma, Chenhao Dang, Mingjie Liao
cs.AI
摘要
优化预训练数据组成对于提升大语言模型的泛化能力至关重要。虽然动态混合策略通过捕捉训练过程中的动态变化优于静态策略,但现有方法难以在计算效率、样本效率以及应对多样化流程的结构灵活性之间取得平衡。为此,我们提出演员-评论家在线数据混合方法(AC-ODM),该方法从强化学习视角处理数据混合问题,采用参数化策略,并在理论上证明该策略可作为动态线性替代函数,最大化梯度的正向干涉效应。为增强实际应用的灵活性,AC-ODM支持两种运行模式:(i)代理模式,适用于固定且预准备好的语料库,此时基于小模型学习到的策略可迁移至更大的目标模型;(ii)非代理模式,适用于无需先验知识、从头开始的直接端到端训练。实验表明,在各种架构下,AC-ODM在收敛速度和下游任务准确率方面显著优于现有方法。在Pythia-1B模型上,相比竞争性基线,AC-ODM最多可减少66%的训练步数达到最优验证困惑度,MMLU准确率相对提升27.5%,HumanEval的pass@1指标提升2.23倍,同时每步的平均墙钟时间几乎可忽略不计(增加0.4%),仅增加2%的额外内存开销。代码已开源:https://github.com/DANG-ai/AC-ODM。
English
Optimizing pretraining data composition is pivotal for LLM generalization. While dynamic mixing outperforms static strategies by capturing evolving training dynamics, current methods fail to reconcile computational efficiency with sample efficiency and structural flexibility for diverse pipelines.We introduce Actor--Critic Online Data Mixing (AC-ODM), which approaches data mixing from a reinforcement learning perspective with a parameterized policy that we theoretically prove to act as a dynamic linear surrogate maximizing the constructive interference of gradients. To enhance practical flexibility, AC-ODM supports two operational modes: (i) a proxy mode for fixed, pre-prepared corpora, where a policy learned on a small model is transferred to a larger target; and (ii) a non-proxy mode for direct end-to-end training from scratch without priors. Empirically, AC-ODM significantly outperforms prior methods in convergence speed and downstream accuracy across various architectures. On Pythia-1B, it reaches optimal validation perplexity using up to 66% fewer training steps than competitive baselines, delivering a 27.5% relative improvement in MMLU accuracy and a 2.23 x higher pass@1 on HumanEval, all while incurring a virtually negligible (0.4%) per-step wall-clock increase and only 2% additional memory overhead. Code is available at https://github.com/DANG-ai/AC-ODM.