CORE: 対比的反省が推論の急速な改善を可能にする

要旨

言語モデルは、検証可能な報酬を用いることで多様な推論タスクにおいて性能を向上させることができる。しかし、パラメトリックな手法（例：RLVR）およびノンパラメトリックな手法（例：プロンプト最適化）のいずれも、通常は数百のトレーニングサンプルと数千回のモデルロールアウトを必要とし、最善の場合でも高コストであり、最悪の場合には手に負えないものとなる。この課題に対処するため、我々はContrastive Reflection（CORE）を導入する。これはノンパラメトリックな学習アルゴリズムであり、過去の推論トレースを比較することで洞察、すなわち成功した試行と失敗した試行の違いを捉えた、推論戦略や制約に関する短く自然言語で記述された記述を生成する。4つの推論タスクにおいて、COREがパラメトリック手法（GRPO）およびノンパラメトリック手法（GEPA、エピソードRAG、MemRL）のいずれよりも、より少ないロールアウトで迅速な改善を実現することを示す。さらに、固定されたロールアウト予算のもと、わずか5つのトレーニングサンプルを用いた場合でも、COREが各ベースラインと同等以上の性能向上を達成することを明らかにする。最後に、COREがノンパラメトリックベースラインよりも大幅にコンテキスト効率が高く、学習した知識をコンパクトで解釈可能な自然言語による洞察として保存しながら、必要なプロンプトトークンが少ないことを強調する。したがって、我々の結果は、成功した推論トレースと失敗した推論トレースの対比を抽象的で有用な洞察に蒸留することが、重み更新、プロンプト最適化、あるいは保存された推論トレースの直接再利用よりも、モデルの自己改善に対するより効率的で解釈可能な経路を提供することを示唆している。

English

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.