DataFlex: 대규모 언어 모델의 데이터 중심 동적 훈련을 위한 통합 프레임워크

초록

데이터 중심 훈련은 모델 파라미터뿐만 아니라 최적화 과정에서 훈련 데이터의 선택, 구성, 가중치 부여까지 최적화함으로써 대규모 언어 모델(LLM)을 개선할 유망한 방향으로 부상하고 있다. 그러나 기존의 데이터 선택, 데이터 혼합 최적화, 데이터 재가중 방법론은 종종 고립된 코드베이스에서 일관성 없는 인터페이스로 개발되어 재현성, 공정한 비교, 실용적인 통합을 저해해 왔다. 본 논문에서는 LLaMA-Factory를 기반으로 구축된 통합 데이터 중심 동적 훈련 프레임워크인 DataFlex를 소개한다. DataFlex는 샘플 선택, 도메인 혼합 조정, 샘플 재가중이라는 세 가지 주요 동적 데이터 최적화 패러다임을 지원하면서도 기존 훈련 워크플로우와 완전히 호환된다. 이 프레임워크는 확장 가능한 트레이너 추상화와 모듈식 컴포넌트를 제공하여 표준 LLM 훈련을 대체하여 사용할 수 있도록 하며, 임베딩 추출, 추론, 그래디언트 계산과 같은 주요 모델 종속 연산을 통합하고 DeepSpeed ZeRO-3를 포함한 대규모 설정을 지원한다. 우리는 다양한 데이터 중심 방법론에 대해 포괄적인 실험을 수행했다. 동적 데이터 선택은 Mistral-7B와 Llama-3.2-3B 모두에서 MMLU 벤치마크에 대해 정적 전체 데이터 훈련보다 consistently 우수한 성능을 보였다. 데이터 혼합의 경우, SlimPajama에서 Qwen2.5-1.5B를 60억 및 300억 토큰 규모로 사전 훈련할 때 DoReMi와 ODM 방법이 기본 비율 대비 MMLU 정확도와 코퍼스 수준의 퍼플렉서티를 모두 향상시켰다. DataFlex는 또한 기존 구현 대비 일관된 실행 시간 개선을 달성했다. 이러한 결과는 DataFlex가 LLM의 데이터 중심 동적 훈련을 위한 효과적이고 효율적이며 재현 가능한 인프라를 제공함을 입증한다.

English

Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.

DataFlex: 대규모 언어 모델의 데이터 중심 동적 훈련을 위한 통합 프레임워크

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

초록

Support