DataFlex：大規模言語モデルのデータ中心型動的トレーニングのための統一フレームワーク

要旨

データ中心トレーニングは、モデルパラメータのみならず、最適化過程における学習データの選択、構成、重み付けも最適化することで大規模言語モデル（LLM）を改善する有望な方向性として登場しました。しかし、既存のデータ選択、データ混合最適化、データ再重み付けの手法は、分断されたコードベースで開発され、インターフェースが一貫しないことが多く、再現性、公平な比較、実用的な統合を妨げています。本論文では、LLaMA-Factoryを基盤とした統一データ中心動的トレーニングフレームワークであるDataFlexを提案します。DataFlexは、サンプル選択、ドメイン混合調整、サンプル再重み付けという3つの主要な動的データ最適化パラダイムをサポートしつつ、元のトレーニングワークフローとの完全な互換性を維持します。拡張可能なトレーナー抽象化とモジュラーコンポーネントを提供し、標準的なLLMトレーニングのドロップイン代替を可能にするとともに、埋め込み抽出、推論、勾配計算といった主要なモデル依存操作を統一し、DeepSpeed ZeRO-3を含む大規模設定をサポートします。複数のデータ中心手法にわたる総合的な実験を実施しました。動的データ選択は、Mistral-7BとLlama-3.2-3Bの両方において、MMLUで静的フルデータトレーニングを一貫して上回りました。データ混合については、DoReMiとODMが、SlimPajamaでQwen2.5-1.5Bを6Bトークンおよび30Bトークンスケールで事前学習する際、デフォルトの比率よりもMMLU精度とコーパスレベルパープレキシティの両方を改善しました。DataFlexは元の実装よりも一貫した実行時改善も達成しています。これらの結果は、DataFlexがLLMのデータ中心動的トレーニングに向けた効果的、効率的、かつ再現性の高いインフラストラクチャを提供することを実証しています。

English

Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.

DataFlex：大規模言語モデルのデータ中心型動的トレーニングのための統一フレームワーク

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

要旨

Support