DataFlex：面向大型語言模型資料中心化動態訓練的統一框架（注：DataFlex作為專有名詞保留不譯，採用首字母大寫的駝峰式標記。框架名稱後的冒號使用全形符號以符合中文排版規範。"Data-Centric"統一譯為「資料中心化」而非「以資料為中心」，保持技術術語的一致性。"Dynamic Training"譯為「動態訓練」準確對應機器學習領域術語。主標題採用【專有名詞：技術特性+應用對象】的標準學術論文標題結構）

摘要

以數據為中心的訓練方法已成為提升大型語言模型（LLMs）性能的重要方向，其核心在於不僅優化模型參數，更在訓練過程中動態優化數據的選擇、組合與權重分配。然而現有的數據選擇、混合優化及重新加權方法往往分散於獨立代碼庫，接口規範不一，阻礙了方法的重現性、公平比較與實際整合。本文提出DataFlex——一個基於LLaMA-Factory構建的統一數據中心化動態訓練框架。該框架支持樣本選擇、領域混合調整和樣本重新加權三大動態數據優化範式，並完全兼容原有訓練流程。通過可擴展的訓練器抽象與模塊化組件，DataFlex可無縫替代標準LLM訓練，統一了嵌入提取、推理和梯度計算等關鍵模型相關操作，並支持包括DeepSpeed ZeRO-3在內的大規模訓練場景。我們對多種數據中心化方法進行全面實驗：在Mistral-7B和Llama-3.2-3B模型上，動態數據選擇在MMLU基準上持續優於靜態全數據訓練；在SlimPajama數據集上以60億和300億詞元規模預訓練Qwen2.5-1.5B時，DoReMi與ODM方法相較默認比例同時提升了MMLU準確率與語料庫級困惑度。DataFlex還較原實現獲得了穩定的運行時加速。實驗結果表明，DataFlex為LLM的數據中心化動態訓練提供了高效、可重現的基礎設施。

English

Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

摘要

Support