MIRA：中期訓練評分基準錨定之來源感知資料選擇

摘要

中期训练已成为现代大语言模型（LLM）开发中的重要阶段，通过使用大规模精选混合数据来增强能力，以进行最终的后期训练。其数据选择问题具有独特性：数据在接近预训练规模的范围内，按照预训练风格的目标进行优化，但其筛选过程针对下游能力，并来自不同格式和训练角色的异构数据源。因此，有效的数据选择既需要可扩展性，也需要适应数据源的语义标准。现有的基于模型的方法扩展性良好，但仅能提供隐式的质量信号。语义选择方法能提供更强的判断，但通常假设固定的评估准则或标准化的数据格式。为解决这一不匹配问题，我们提出了MIRA，一种基于自锚定准则发现的源感知过滤框架。其核心思想是将准则构建纳入数据选择过程：MIRA首先针对每个源组发现需要评估的内容，然后将这些判断提炼为可扩展的学生评分器，用于全语料库过滤。在包含21个数据源和5个源组的代码导向中期训练中，MIRA在九个代码基准测试中均优于现有的选择基线，并且在使用仅一半令牌的情况下，达到了与全语料库运行相当的效果。

English

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.