CORE：對比反思實現推理能力快速提升

摘要

語言模型可透過可驗證的獎勵來提升多種推理任務的表現。然而，無論是參數化方法（例如RLVR）或非參數化方法（例如提示詞優化），通常都需要數百個訓練樣本與數千次模型推演，使得這些方法在最理想的情況下成本高昂，在最糟的情況下甚至難以實行。為了解決此挑戰，我們提出對比反思（CORE），這是一種非參數化學習演算法，透過比較過去的推理蹤跡來產生洞見：簡短的自然語言描述，用以捕捉成功與不成功解題嘗試之間差異的推理策略與約束。在四個推理任務中，我們證明了CORE能比參數化方法（GRPO）及非參數化方法（GEPA、情境式RAG、MemRL）更快地實現改進，同時使用更少的推演次數。在固定推演預算下（僅使用少至五個訓練樣本），我們接著展示CORE能達到與各基線方法相當或更優的效能增益。最後，我們強調CORE在情境效率上也顯著優於非參數化基線，所需提示詞權杖更少，同時將學到的知識儲存為簡潔且可解釋的自然語言洞見。因此，我們的結果表明，將成功與不成功推理蹤跡之間的對比提煉為抽象且有用的洞見，能比權重更新、提示詞優化或直接重用儲存推理蹤跡提供一條更高效且可解釋的模型自我改進途徑。

English

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.