CORE: 대조적 반성을 통한 추론의 빠른 개선

초록

언어 모델은 검증 가능한 보상을 활용하여 다양한 추론 과제에서 성능을 향상시킬 수 있다. 그러나 매개변수 기반(예: RLVR) 및 비매개변수 기반(예: 프롬프트 최적화) 접근법 모두 일반적으로 수백 개의 학습 샘플과 수천 회의 모델 롤아웃을 필요로 하여, 최상의 경우에도 비용이 많이 들고 최악의 경우에는 다루기 어렵다. 이러한 문제를 해결하기 위해, 우리는 대조 반성(Contrastive Reflection, CORE)이라는 비매개변수 학습 알고리즘을 소개한다. 이 알고리즘은 과거의 추론 흔적을 비교하여 통찰, 즉 성공과 실패한 문제 시도 간의 차이를 포착하는 추론 전략과 제약 조건에 대한 간결한 자연어 설명을 생성한다. 네 가지 추론 과제에서 CORE가 매개변수 기반(GRPO) 및 비매개변수 기반(GEPA, 에피소드 RAG, MemRL) 방법보다 더 적은 롤아웃으로 더 빠른 성능 향상을 가능하게 함을 보여준다. 또한 고정된 롤아웃 예산 하에서 최소 5개의 학습 샘플만으로도 CORE가 각 기준선과 유사하거나 더 큰 성능 향상을 달성함을 입증한다. 마지막으로, CORE가 비매개변수 기준선보다 훨씬 더 맥락 효율적이며, 학습된 지식을 간결하고 해석 가능한 자연어 통찰로 저장하면서 더 적은 프롬프트 토큰을 필요로 한다는 점을 강조한다. 따라서 본 연구 결과는 성공 및 실패 추론 흔적 간의 대조를 추상적이고 유용한 통찰로 증류하는 것이 가중치 업데이트, 프롬프트 최적화 또는 저장된 추론 흔적의 직접 재사용보다 모델 자기 개선을 위한 더 효율적이고 해석 가능한 경로를 제공할 수 있음을 시사한다.

English

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.