AC-ODM: 샘플 효율적인 LLM 사전학습을 위한 액터-크리틱 온라인 데이터 혼합

초록

사전 학습 데이터 구성을 최적화하는 것은 LLM의 일반화 성능에 있어 핵심적이다. 동적 혼합(dynamic mixing)은 변화하는 학습 동학을 포착함으로써 정적 전략보다 우수하지만, 현재의 방법들은 다양한 파이프라인에 대해 계산 효율성과 샘플 효율성 및 구조적 유연성을 조화시키지 못한다. 우리는 강화 학습 관점에서 데이터 혼합에 접근하는 Actor-Critic Online Data Mixing (AC-ODM)을 소개한다. 이 방법은 파라미터화된 정책을 사용하며, 이 정책이 기울기의 상호 보강 간섭(constructive interference)을 극대화하는 동적 선형 대리 모델(dynamic linear surrogate)으로 작동함을 이론적으로 증명한다. 실용적 유연성을 높이기 위해 AC-ODM은 두 가지 작동 모드를 지원한다: (i) 프록시 모드(proxy mode)는 고정된 사전 준비 코퍼스에 대해 작은 모델에서 학습된 정책을 더 큰 대상 모델로 전이하는 방식이고, (ii) 비프록시 모드(non-proxy mode)는 사전 지식 없이 처음부터 직접 종단간 학습(end-to-end training)을 수행하는 방식이다. 실험적으로 AC-ODM은 다양한 아키텍처에서 수렴 속도와 하류 작업 정확도 측면에서 이전 방법들을 크게 능가한다. Pythia-1B 모델에서 AC-ODM은 경쟁력 있는 기준 모델들보다 최대 66% 적은 학습 단계로 최적의 검증 혼란도(validation perplexity)에 도달하며, MMLU 정확도에서 27.5%의 상대적 개선과 HumanEval에서 2.23배 높은 pass@1을 달성한다. 이 모든 성능 향상은 단계당 벽시계 시간(wall-clock time)이 거의 무시할 수준(0.4%)으로 증가하고 메모리 오버헤드가 2%에 불과한 상태에서 이루어진다. 코드는 https://github.com/DANG-ai/AC-ODM에서 확인할 수 있다.

English

Optimizing pretraining data composition is pivotal for LLM generalization. While dynamic mixing outperforms static strategies by capturing evolving training dynamics, current methods fail to reconcile computational efficiency with sample efficiency and structural flexibility for diverse pipelines.We introduce Actor--Critic Online Data Mixing (AC-ODM), which approaches data mixing from a reinforcement learning perspective with a parameterized policy that we theoretically prove to act as a dynamic linear surrogate maximizing the constructive interference of gradients. To enhance practical flexibility, AC-ODM supports two operational modes: (i) a proxy mode for fixed, pre-prepared corpora, where a policy learned on a small model is transferred to a larger target; and (ii) a non-proxy mode for direct end-to-end training from scratch without priors. Empirically, AC-ODM significantly outperforms prior methods in convergence speed and downstream accuracy across various architectures. On Pythia-1B, it reaches optimal validation perplexity using up to 66% fewer training steps than competitive baselines, delivering a 27.5% relative improvement in MMLU accuracy and a 2.23 x higher pass@1 on HumanEval, all while incurring a virtually negligible (0.4%) per-step wall-clock increase and only 2% additional memory overhead. Code is available at https://github.com/DANG-ai/AC-ODM.