すべての例をプログラミングする：専門家レベルのスケールで事前トレーニングデータの品質を向上させる

要旨

大規模言語モデルの事前学習は従来、人間の専門家がコーパスの品質を向上させるためのヒューリスティクスを作成することに依存しており、これまでに多数のルールが開発されてきました。しかし、これらのルールは個々の例の独自の特性に効果的に対処する柔軟性に欠けています。一方、個々の例に適したルールを適用することは、人間の専門家にとって実用的ではありません。本論文では、わずか0.3Bのパラメータを持つ小規模言語モデルでも、人間の専門家と同等の大幅なデータ精製能力を示すことを示します。私たちは、プログラミング・エブリー・エグザンプル（ProX）という、データ精製をプログラミング課題として扱い、各個々の例に対して文字列の正規化などの細かい操作を生成および実行することで、モデルがコーパスを精製できるようにする新しいフレームワークを紹介します。実験結果は、ProXでキュレーションされたデータで事前学習されたモデルが、さまざまな下流ベンチマークで、他の選択方法でフィルタリングされたオリジナルデータよりも2%以上の性能を発揮することを示しています。その効果は、C4、RedPajama-V2、FineWebを含むさまざまなモデルサイズと事前学習コーパスにわたります。さらに、ProXは、ドメイン固有の継続的事前学習においても大きな潜在能力を示します。ドメイン固有の設計なしでOpenWebMathで訓練されたモデルは、ProXで精製されたモデルによって、Mistral-7Bよりも平均精度を7.6%向上させ、Llama-2-7Bでは14.6%、CodeLlama-7Bでは20.3%向上させ、200Bトークンで訓練されたLlemma-7Bなどのモデルと同等の10Bトークンで競合することができます。さらなる分析では、ProXは訓練のFLOPを大幅に節約し、効率的なLLM事前学習の有望な道筋を提供しています。私たちは、再現可能な研究と将来のイノベーションのために、ProXを100B以上のコーパス、モデルとともにオープンソース化し、すべての訓練および実装の詳細を共有しています。コード：https://github.com/GAIR-NLP/ProX

English

Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, and FineWeb. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training.We are open-sourcing ProX with >100B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: https://github.com/GAIR-NLP/ProX

すべての例をプログラミングする：専門家レベルのスケールで事前トレーニングデータの品質を向上させる

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

要旨

Summary

Support

Support