MIRA: ソース認識データ選択のための中間訓練ルーブリックアンカーリング

要旨

ミッドトレーニングは現代の大規模言語モデル（LLM）開発において重要な段階となっており、大規模に厳選された混合物を用いて最終的な事後学習の前に能力を強化する。そのデータ選択問題は特徴的である。データは事前学習に近い規模で事前学習スタイルの目的関数の下で最適化されるが、下流の能力に向けて厳選され、異なる形式やトレーニング上の役割を持つ異種ソースから抽出される。その結果、効果的な選択にはスケーラビリティとソース適応型の意味基準の両方が必要となる。既存のモデルベース手法はスケーラビリティに優れるが、暗黙的な品質信号のみを提供する。意味選択手法はより強力な判断を提供するが、通常は固定された評価基準や標準化されたデータ形式を前提とする。この不一致に対処するため、我々は自己アンカー型評価基準発見に基づくソース認識フィルタリングフレームワークMIRAを提案する。核となるアイデアは、評価基準の構築をデータ選択の一部にすることである。MIRAはまず各ソースグループに対して何を評価すべきかを発見し、その後、それらの判断をスケーラブルな学生スコアラーに蒸留して全文書コーパスのフィルタリングを行う。21のソースと5つのソースグループからなるコード指向ミッドトレーニングにおいて、MIRAは9つのコードベンチマークで選択ベースラインを上回り、トークン数を半分に抑えながら全文書コーパス実行と同等の性能を達成した。

English

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.