MIRA：面向源感知数据选择的中训练锚定准则

摘要

中期训练已成为现代大型语言模型开发中的重要阶段，它利用大规模精选混合数据集，在最终后训练之前增强模型能力。其数据选择问题具有独特性：数据在近乎预训练规模的条件下，以预训练风格的目标进行优化，但针对下游能力进行筛选，并来自不同格式和训练角色的异构来源。因此，有效的选择既需要可扩展性，也需要源自适应的语义标准。现有的基于模型的方法扩展性良好，但仅提供隐式的质量信号。语义选择方法能提供更强的判断，但通常假设固定的评价标准或标准化的数据格式。为解决这一不匹配问题，我们提出了MIRA——一种基于自锚定评价标准发现的源感知过滤框架。其核心思想是将评价标准的构建纳入数据选择过程：MIRA首先发现每个源组应该评估哪些方面，然后将这些判断提炼为可扩展的学生评分器，用于全语料库过滤。在涉及21个来源和5个源组的面向代码的中期训练中，MIRA在九个代码基准测试上超越了选择基线，且在使用仅一半词元的情况下达到了与完整语料库运行相当的性能。

English

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.