RefineX: 전문가 가이드 프로그램을 통해 대규모 사전 학습 데이터 정제 학습

초록

대규모 언어 모델(LLM)의 기본적인 역량은 사전 학습 코퍼스의 품질에 깊은 영향을 받습니다. 그러나 대규모 데이터 품질 향상은 정제 효과와 처리 효율성 간의 상충 관계로 인해 여전히 큰 과제로 남아 있습니다. 규칙 기반 필터링이 여전히 주된 패러다임이지만, 이는 일반적으로 문서 수준에서 작동하며 문서 내 특정 콘텐츠를 세밀하게 정제하는 데 필요한 세분성을 제공하지 못합니다. ProX와 같은 최신 연구에서 영감을 받아, 우리는 프로그램적 편집 작업을 통해 대규모 사전 학습 데이터를 세밀하게 정제하는 새로운 프레임워크인 RefineX를 제안합니다. RefineX는 원시 텍스트의 다양성과 자연스러움을 안정적으로 보존하면서도 효율적이고 세밀한 데이터 정제를 가능하게 합니다. RefineX의 핵심 강점은 고품질의 전문가 지도 하에 이루어진 종단 간 정제 결과를 최소한의 편집 기반 삭제 프로그램으로 정제하는 데 있습니다. 이 고정밀 정제 파이프라인은 코퍼스 내 모든 인스턴스를 대규모로 체계적으로 개선할 수 있는 효율적이고 신뢰할 수 있는 정제 모델을 훈련하는 데 사용됩니다. 우리는 RefineX를 다양한 모델 규모에서 처음부터 사전 학습을 통해 평가했으며, 이는 다양한 다운스트림 작업에서 원시 데이터, 필터링된 데이터 또는 다른 방식으로 정제된 데이터로 훈련된 모델을 일관되게 능가하는 것으로 나타났습니다. 750M 모델에서 RefineX는 lighteval 작업에서 평균 2.6%-7.2%의 성능 향상을 보였으며, 훨씬 적은 훈련 토큰을 사용하면서도 비슷한 성능을 달성했습니다. 추가 분석은 RefineX가 높은 효율성과 정밀도로 텍스트 품질을 안정적으로 향상시키며, 종단 간 생성 및 Prox-C와 같은 기존 접근법을 능가한다는 것을 보여줍니다. 이러한 결과는 RefineX를 현대 LLM 파이프라인에서 사전 학습 데이터를 최적화하기 위한 확장 가능하고 효과적이며 신뢰할 수 있는 솔루션으로 자리매김합니다.

English

The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose RefineX, a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs. This high-precision distillation pipeline is used to train an efficient and reliable refine model that can systematically improve every instance in the corpus at scale. We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data across diverse downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on lighteval tasks, and achieves comparable performance using significantly fewer training tokens. Further analysis shows that RefineX reliably enhances text quality with both high efficiency and precision, outperforming prior approaches such as end-to-end generation and Prox-C. These results position RefineX as a scalable, effective, and reliable solution for optimizing pre-training data in modern LLM pipelines.

RefineX: 전문가 가이드 프로그램을 통해 대규모 사전 학습 데이터 정제 학습

RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

초록

Support