OPUS: 大規模言語モデル事前学習における各イテレーションでの効率的かつ体系的なデータ選択に向けて

要旨

高品質な公開テキストが枯渇に近づく「データの壁」現象により、事前学習はより多くのトークンからより優れたトークンへと重点を移しつつある。しかし、既存の手法は、学習動態を無視するヒューリスティックな静的フィルタに依存するか、あるいは生の勾配に基づく動的だが最適化手法に依存しない基準を使用している。本論文では、最適化手法が誘導する更新空間において効用を定義する動的データ選択フレームワーク「OPUS（Optimizer-induced Projected Utility Selection）」を提案する。OPUSは、現代的な最適化手法によって形成された候補データの実効更新を、安定したin-distributionプロキシから導出された目標方向へ射影することでスコアリングを行う。スケーラビリティを確保するため、計算効率化にGhost技法とCountSketchを、データ多様性確保にボルツマンサンプリングを採用し、わずか4.7%の追加計算オーバーヘッドで済む。OPUSは多様なコーパス、品質階層、最適化手法、モデル規模において顕著な成果を達成した。FineWebおよびFineWeb-Eduで30Bトークンを用いたGPT-2 Large/XLの事前学習では、産業レベルのベースラインを上回り、200Bトークンを用いた完全学習をも凌駕する性能を示した。さらに、産業レベルの静的フィルタと組み合わせることで、データ品質が低い場合でも事前学習効率をさらに向上させた。加えて、SciencePediaを用いたQwen3-8B-Baseの継続事前学習では、3Bトークンでの完全学習と比較して、わずか0.5Bトークンで優れた性能を達成し、専門領域における著しいデータ効率の向上を実証した。

English

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

OPUS: 大規模言語モデル事前学習における各イテレーションでの効率的かつ体系的なデータ選択に向けて

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

要旨

Support