编程每个示例：像专家一样大规模提升预训练数据质量

摘要

传统上，大型语言模型的预训练依赖于人类专家制定启发式规则以改善语料库质量，迄今已制定了许多规则。然而，这些规则缺乏灵活性，无法有效解决每个示例的独特特征。同时，为每个示例应用量身定制的规则对人类专家来说是不切实际的。在本文中，我们展示即使是拥有仅0.3B参数的小型语言模型也可以表现出与人类专家相当的显著数据细化能力。我们引入了“为每个示例编程”（ProX）的新颖框架，将数据细化视为编程任务，使模型能够通过生成和执行细粒度操作（如字符串规范化）来细化语料库，以规模化地处理每个个别示例。实验结果表明，在经过ProX筛选的数据上预训练的模型在各种下游基准测试中表现优于原始数据或通过其他选择方法筛选的数据超过2%。其有效性跨越各种模型大小和预训练语料库，包括C4、RedPajama-V2和FineWeb。此外，ProX在领域特定的持续预训练中展现出显著潜力：在没有领域特定设计的情况下，经ProX细化的OpenWebMath训练的模型优于人工制定的基于规则的方法，将平均准确率提高了7.6%，对于Llama-2-7B和CodeLlama-7B分别提高了14.6%和20.3%，在10B标记内与Llemma-7B等模型可比的情况下，超过了Mistral-7B。进一步的分析突显了ProX显著节省了训练FLOPs，为高效LLM预训练提供了一个有前途的途径。我们正在以>100B语料库、模型的形式开源ProX，并分享所有训练和实现细节，以便进行可重复研究和未来创新。代码：https://github.com/GAIR-NLP/ProX

English

Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, and FineWeb. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training.We are open-sourcing ProX with >100B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: https://github.com/GAIR-NLP/ProX

编程每个示例：像专家一样大规模提升预训练数据质量

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

摘要

Support