将元经验内化为记忆以指导大语言模型的强化学习

摘要

基于可验证奖励的强化学习（RLVR）已成为增强大语言模型推理能力的有效方法。尽管效果显著，但RLVR面临元学习瓶颈：该方法缺乏人类学习循环中除练习与验证外固有的错误归因和经验内化机制，从而限制了细粒度功劳分配与可复用知识体系的形成。我们将这种从历史错误中提炼的可复用知识表征定义为元经验。基于此洞见，我们提出元经验学习框架，将自蒸馏得到的元经验融入模型的参数化记忆。在标准RLVR基础上，我们引入创新设计：利用大语言模型的自我验证能力，对正确与错误推理轨迹进行对比分析，精准定位推理错误产生的分叉点，并将其总结为可泛化的元经验。通过最小化负对数似然，元经验被进一步内化至大语言模型的参数化记忆中，由此产生语言建模化的奖励信号，在正误推理轨迹间建立桥梁，促进知识的有效复用。实验结果表明，MEL在多个基准测试中实现稳定提升，在不同规模模型上获得3.92%–4.73%的Pass@1增益。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond practice and verification, thereby limiting fine-grained credit assignment and reusable knowledge formation. We term such reusable knowledge representations derived from past errors as meta-experience. Based on this insight, we propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model's parametric memory. Building upon standard RLVR, we introduce an additional design that leverages the LLM's self-verification capability to conduct contrastive analysis on paired correct and incorrect trajectories, identify the precise bifurcation points where reasoning errors arise, and summarize them into generalizable meta-experience. The meta-experience is further internalized into the LLM's parametric memory by minimizing the negative log-likelihood, which induces a language-modeled reward signal that bridges correct and incorrect reasoning trajectories and facilitates effective knowledge reuse. Experimental results demonstrate that MEL achieves consistent improvements on benchmarks, yielding 3.92%--4.73% Pass@1 gains across varying model sizes.

将元经验内化为记忆以指导大语言模型的强化学习

Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models

摘要

Support