ChatPaper.aiChatPaper

元認知強化推理模型:自我對齊的強化學習

Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning

September 26, 2025
作者: Yoonjeon Kim, Doohyuk Jang, Eunho Yang
cs.AI

摘要

近期关于推理模型的研究探讨了语言模型的元认知能力,即模型自身知晓如何思考的能力。我们通过证明真实展开与预测元信息之间的严重不对齐,论证了大型推理模型缺乏这一元认知特性。我们提出,将元预测与真实展开对齐将带来显著的性能提升。为验证这一假设,我们设计了一种通过自我对齐增强元认知(MASA)的训练流程,并证明增强的元认知直接转化为准确率的提升。与现有的元认知推理模型不同,我们的方法无需外部训练资源,而是利用自我生成的信号来训练元认知。此外,我们的方法通过以下两点实现了高效训练:一是过滤掉那些要么过于简单要么无法解决的零方差提示;二是在展开过程不太可能导向正确答案时及时终止。实验结果令人鼓舞:我们的策略在领域内任务上显著提升了准确率和训练效率,并在领域外基准测试中展现出强大的泛化能力。具体而言,我们的方法能将GRPO训练速度提升超过1.28倍以达到同等性能,并在AIME25上实现19.3%的准确率提升,在六个数学基准测试中平均提升6.2%。采用元认知指导的训练增强了领域外泛化能力,在GPQA-Diamond上提升了3.87%,在涵盖逻辑、科学和编程领域的13个基准测试中整体准确率提升了2.08%。
English
Recent studies on reasoning models explore the meta-awareness of language models, the ability to know how to think by itself. We argue that large reasoning models lack this meta-awareness property by proving severe misalignment between true rollouts and predicted meta information. We posit that aligning meta-prediction with true rollouts will lead to significant performance gains. To verify this hypothesis, we design a training pipeline that boosts Meta-Awareness via Self-Alignment (MASA), and prove that enhanced meta-awareness directly translates to improved accuracy. Unlike existing meta-cognitive reasoning models, our method does not require external training sources but leverages self-generated signals to train meta-awareness. Moreover, our method enables efficient training by i) filtering out zero-variance prompts that are either trivial or unsolvable and ii) cutting off lengthy rollouts when they are unlikely to lead to correct answers. The results are inspiring: our strategy yields significant improvements in both accuracy and training efficiency on in-domain tasks and shows strong generalization to out-of-domain benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve a 19.3% gain in accuracy on AIME25, and a 6.2 % average gain over six mathematics benchmarks. Training with meta-cognitive guidance enhances out-of-domain generalization, giving a 3.87 % boost on GPQA-Diamond and a 2.08 % overall accuracy gain across 13 benchmarks spanning logical, scientific, and coding domains.
PDF392October 10, 2025