RefineX:基於專家指導程序的大規模預訓練數據精煉學習
RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs
July 4, 2025
作者: Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, Xueqi Cheng
cs.AI
摘要
大型语言模型(LLMs)的基础能力深受其预训练语料库质量的影响。然而,大规模提升数据质量仍面临重大挑战,主要在于精炼效果与处理效率之间的权衡。尽管基于规则的过滤仍是主导范式,但其通常在文档层面操作,缺乏对文档内特定内容进行精细化处理的能力。受ProX等新兴工作的启发,我们提出了RefineX,一个通过程序化编辑任务实现大规模、精准预训练数据精炼的新框架。RefineX能够在高效且细粒度地精炼数据的同时,可靠地保留原始文本的多样性与自然性。RefineX的核心优势在于将高质量、专家指导的端到端精炼结果提炼为基于最小编辑的删除程序。这一高精度提炼流程用于训练一个高效且可靠的精炼模型,该模型能够系统地大规模提升语料库中的每个实例。我们在多个模型规模上从头开始预训练评估RefineX,发现其在多样化的下游任务中始终优于基于原始、过滤或其他方式精炼数据训练的模型。在750M模型上,RefineX在lighteval任务上平均提升了2.6%-7.2%,并且使用显著更少的训练标记实现了可比的性能。进一步分析表明,RefineX以高效率和精度可靠地提升了文本质量,超越了端到端生成和Prox-C等先前方法。这些成果确立了RefineX作为现代LLM流程中优化预训练数据的可扩展、有效且可靠的解决方案。
English
The foundational capabilities of large language models (LLMs) are deeply
influenced by the quality of their pre-training corpora. However, enhancing
data quality at scale remains a significant challenge, primarily due to the
trade-off between refinement effectiveness and processing efficiency. While
rule-based filtering remains the dominant paradigm, it typically operates at
the document level and lacks the granularity needed to refine specific content
within documents. Inspired by emerging work such as ProX, we propose
RefineX, a novel framework for large-scale, surgical refinement of
pre-training data through programmatic editing tasks. RefineX enables efficient
and fine-grained data refinement while reliably preserving the diversity and
naturalness of raw text. The core strength of RefineX lies in distilling
high-quality, expert-guided end-to-end refinement results into minimal
edit-based deletion programs. This high-precision distillation pipeline is used
to train an efficient and reliable refine model that can systematically improve
every instance in the corpus at scale. We evaluate RefineX across from-scratch
pre-training at multiple model scales and find that it consistently outperforms
models trained on raw, filtered, or alternatively refined data across diverse
downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on
lighteval tasks, and achieves comparable performance using significantly fewer
training tokens. Further analysis shows that RefineX reliably enhances text
quality with both high efficiency and precision, outperforming prior approaches
such as end-to-end generation and Prox-C. These results position RefineX as a
scalable, effective, and reliable solution for optimizing pre-training data in
modern LLM pipelines.