RefineX: 専門家が導くプログラムから大規模な事前学習データを洗練する学習

要旨

大規模言語モデル（LLM）の基盤となる能力は、その事前学習コーパスの品質に深く影響を受ける。しかし、大規模なデータ品質の向上は、精緻化の効果と処理効率のトレードオフにより、依然として重要な課題である。ルールベースのフィルタリングが主流のパラダイムであるものの、これは通常ドキュメントレベルで動作し、ドキュメント内の特定のコンテンツを精緻化するための粒度を欠いている。ProXなどの新たな研究に着想を得て、我々はRefineXを提案する。これは、プログラム的な編集タスクを通じて大規模かつ精密な事前学習データの精緻化を行う新しいフレームワークである。RefineXは、効率的かつ細かい粒度でのデータ精緻化を可能にしつつ、生のテキストの多様性と自然さを確実に保持する。RefineXの中核的な強みは、高品質で専門家がガイドしたエンドツーエンドの精緻化結果を、最小限の編集ベースの削除プログラムに蒸留することにある。この高精度の蒸留パイプラインを用いて、コーパス内のすべてのインスタンスを大規模に体系的に改善する効率的で信頼性の高い精緻化モデルを訓練する。我々は、複数のモデルスケールでのゼロからの事前学習においてRefineXを評価し、多様な下流タスクにおいて、生データ、フィルタリングされたデータ、または他の方法で精緻化されたデータで訓練されたモデルを一貫して上回ることを確認した。750Mモデルでは、RefineXはlightevalタスクにおいて平均2.6%-7.2%の向上をもたらし、大幅に少ない訓練トークンを使用して同等の性能を達成した。さらなる分析により、RefineXは高い効率性と精度でテキスト品質を確実に向上させ、エンドツーエンド生成やProx-Cなどの従来のアプローチを上回ることが示された。これらの結果は、RefineXを現代のLLMパイプラインにおける事前学習データの最適化のためのスケーラブルで効果的かつ信頼性の高いソリューションとして位置づけるものである。

English

The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose RefineX, a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs. This high-precision distillation pipeline is used to train an efficient and reliable refine model that can systematically improve every instance in the corpus at scale. We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data across diverse downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on lighteval tasks, and achieves comparable performance using significantly fewer training tokens. Further analysis shows that RefineX reliably enhances text quality with both high efficiency and precision, outperforming prior approaches such as end-to-end generation and Prox-C. These results position RefineX as a scalable, effective, and reliable solution for optimizing pre-training data in modern LLM pipelines.

RefineX: 専門家が導くプログラムから大規模な事前学習データを洗練する学習

RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

要旨

Support