MIRA: 소스 인식 데이터 선별을 위한 중간 훈련 평가 기준 고정

초록

중간 훈련은 현대 대규모 언어 모델 개발에서 중요한 단계로 자리 잡았으며, 최종 후속 훈련 전에 대규모 선별 혼합 데이터를 사용하여 능력을 강화한다. 이 단계의 데이터 선별 문제는 고유한 특성을 가진다. 데이터는 사전 훈련에 가까운 규모에서 사전 훈련 스타일의 목적 함수를 최적화하지만, 하위 작업 능력을 향상시키기 위해 선별되며 서로 다른 형식과 훈련 역할을 가진 이질적 소스에서 추출된다. 결과적으로 효과적인 선별을 위해서는 확장성과 소스 적응적 의미 기준이 모두 필요하다. 기존 모델 기반 방법은 확장성은 우수하지만 암묵적인 품질 신호만 제공한다. 의미 기반 선별 방법은 더 강력한 판단을 제공하지만, 일반적으로 고정된 평가 기준이나 표준화된 데이터 형식을 가정한다. 이러한 불일치를 해결하기 위해, 우리는 자기 고정 평가 기준 발견에 기반한 소스 인식 필터링 프레임워크인 MIRA를 제안한다. 핵심 아이디어는 평가 기준 구성을 데이터 선별의 일부로 만드는 것이다. MIRA는 먼저 각 소스 그룹에 대해 평가해야 할 사항을 발견한 후, 이러한 판단을 확장 가능한 학생 점수 모델로 증류하여 전체 코퍼스 필터링에 사용한다. 21개의 소스와 5개의 소스 그룹을 사용한 코드 중심 중간 훈련에서 MIRA는 9개의 코드 벤치마크에서 선별 기준선을 능가하며, 절반의 토큰만 사용하면서 전체 코퍼스 실행과 동등한 성능을 달성했다.

English

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.