메타 인식이 추론 모델을 강화한다: 자기 정렬 강화 학습

초록

최근 추론 모델에 대한 연구는 언어 모델의 메타 인식, 즉 스스로 사고하는 방법을 아는 능력을 탐구하고 있습니다. 우리는 대규모 추론 모델이 실제 롤아웃과 예측된 메타 정보 간의 심각한 불일치를 증명함으로써 이러한 메타 인식 속성이 부족하다고 주장합니다. 우리는 메타 예측을 실제 롤아웃과 일치시키는 것이 성능의 상당한 향상으로 이어질 것이라고 가정합니다. 이 가설을 검증하기 위해, 우리는 자기 정렬을 통한 메타 인식 강화(MASA) 훈련 파이프라인을 설계하고, 강화된 메타 인식이 정확도 향상으로 직접 이어짐을 입증합니다. 기존의 메타 인지 추론 모델과 달리, 우리의 방법은 외부 훈련 소스를 필요로 하지 않고 자기 생성 신호를 활용하여 메타 인식을 훈련합니다. 또한, 우리의 방법은 i) 사소하거나 해결 불가능한 제로 분산 프롬프트를 필터링하고, ii) 정답으로 이어질 가능성이 낮은 긴 롤아웃을 차단함으로써 효율적인 훈련을 가능하게 합니다. 결과는 고무적입니다: 우리의 전략은 도메인 내 작업에서 정확도와 훈련 효율성 모두에서 상당한 개선을 가져오며, 도메인 외 벤치마크에서도 강력한 일반화 능력을 보여줍니다. 더 구체적으로, 우리의 방법은 동일한 성능에 도달하기 위해 GRPO 훈련을 1.28배 이상 가속화할 수 있으며, AIME25에서 19.3%의 정확도 향상을 달성하고, 6개의 수학 벤치마크에서 평균 6.2%의 정확도 향상을 보입니다. 메타 인지 지도를 통한 훈련은 도메인 외 일반화를 강화하여 GPQA-Diamond에서 3.87%의 향상과 논리, 과학, 코딩 도메인을 아우르는 13개 벤치마크에서 평균 2.08%의 정확도 향상을 제공합니다.

English

Recent studies on reasoning models explore the meta-awareness of language models, the ability to know how to think by itself. We argue that large reasoning models lack this meta-awareness property by proving severe misalignment between true rollouts and predicted meta information. We posit that aligning meta-prediction with true rollouts will lead to significant performance gains. To verify this hypothesis, we design a training pipeline that boosts Meta-Awareness via Self-Alignment (MASA), and prove that enhanced meta-awareness directly translates to improved accuracy. Unlike existing meta-cognitive reasoning models, our method does not require external training sources but leverages self-generated signals to train meta-awareness. Moreover, our method enables efficient training by i) filtering out zero-variance prompts that are either trivial or unsolvable and ii) cutting off lengthy rollouts when they are unlikely to lead to correct answers. The results are inspiring: our strategy yields significant improvements in both accuracy and training efficiency on in-domain tasks and shows strong generalization to out-of-domain benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve a 19.3% gain in accuracy on AIME25, and a 6.2 % average gain over six mathematics benchmarks. Training with meta-cognitive guidance enhances out-of-domain generalization, giving a 3.87 % boost on GPQA-Diamond and a 2.08 % overall accuracy gain across 13 benchmarks spanning logical, scientific, and coding domains.

메타 인식이 추론 모델을 강화한다: 자기 정렬 강화 학습

Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning

초록

Support