CORE：对比性反思促成推理能力的快速提升

摘要

语言模型可利用可验证奖励，在多种推理任务上实现性能提升。然而，无论是参数化方法（如RLVR）还是非参数化方法（如提示优化），通常都需要数百个训练样本和数千次模型展开，这使得它们在最佳情况下成本高昂，在最差情况下难以处理。为应对这一挑战，我们提出了对比反思（CORE），一种非参数学习算法，该算法通过比较过去的推理轨迹来生成洞察：即简短的自然语言描述，描述推理策略和约束，捕捉成功与失败问题尝试之间的差异。在四个推理任务上，我们证明CORE比参数化方法（GRPO）和非参数化方法（GEPA、情景RAG和MemRL）能实现更快的提升，同时使用更少的模型展开。在固定展开预算下，仅使用五个训练样本，我们进一步展示CORE也能达到与各基线相当或更高的性能提升。最后，我们强调CORE在上下文效率方面也显著优于非参数基线，需要更少的提示词元，同时将学习到的知识存储为紧凑、可解释的自然语言洞察。因此，我们的结果表明，将成功与失败推理轨迹之间的对比提炼为抽象且有用的洞察，相比于权重更新、提示优化或直接复用存储的推理轨迹，能为模型自我改进提供更高效、更可解释的路径。

English

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.