DataFlex：面向大语言模型数据中心化动态训练的统一框架

摘要

以数据为中心的训练方法已成为提升大语言模型（LLMs）性能的重要方向，其核心在于不仅优化模型参数，更在优化过程中动态调整训练数据的选择、组合与权重分配。然而，现有数据选择、数据混合优化和数据重加权等方法往往基于相互孤立的代码库开发，接口规范不统一，严重阻碍了方法复现、公平比较与实际集成。本文提出DataFlex——一个基于LLaMA-Factory构建的统一数据中心化动态训练框架。该框架支持样本选择、领域混合调整和样本重加权三大动态数据优化范式，同时完全兼容原有训练流程。通过提供可扩展的训练器抽象与模块化组件，DataFlex能够直接替代标准LLM训练流程，并统一了嵌入提取、推理和梯度计算等关键模型相关操作，支持包括DeepSpeed ZeRO-3在内的大规模训练场景。我们在多种数据中心化方法上开展综合实验：动态数据选择在Mistral-7B和Llama-3.2-3B模型上均能稳定超越MMLU基准的静态全数据训练效果；对于数据混合优化，在SlimPajama数据集上以60亿和300亿token规模预训练Qwen2.5-1.5B时，DoReMi与ODM方法相较默认比例同时提升了MMLU准确率与语料库级别困惑度；DataFlex还实现了相较原版代码的持续运行效率提升。这些结果表明，DataFlex为LLM的数据中心化动态训练提供了高效、可复现的基础设施支持。

English

Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.