해로움에서 도움으로: 추론을 위한 인-컨텍스트 데모를 추론 언어 모델의 자산으로 전환하기

초록

최근의 추론 대형 언어 모델(RLMs), 특히 검증 기반 강화 학습으로 훈련된 모델들은 직접 답변하기보다 소수 샷 CoT(Chain-of-Thought)에서 더 나쁜 성능을 보이는 경우가 많습니다. 우리는 DeepSeek-R1의 고품질 추론 흔적을 데모로 사용하여 이 역설을 재검토했으며, 데모가 최적임에도 불구하고 더 많은 예제를 추가할수록 정확도가 지속적으로 저하되는 것을 발견했습니다. 상세한 분석을 통해 이러한 저하의 두 가지 메커니즘을 밝혀냈습니다: (i) 의미론적 오도, 즉 높은 텍스트 유사성으로 인해 모델이 대상을 예제와 동일하게 간주하고 중간 단계를 그대로 복사하는 현상; (ii) 전략 전달 실패, 즉 모델이 유용한 추론 전략을 추출하고 이를 대상 질문에 적용하는 데 어려움을 겪는 현상. 이를 바탕으로 우리는 Insight-to-Solve(I2S)를 도입했습니다. 이는 데모를 명시적이고 재사용 가능한 통찰로 전환하고 대상별 추론 흔적을 도출하는 순차적 테스트 시간 절차입니다. 선택적으로, 추론은 일관성과 정확성을 위해 자체적으로 개선됩니다(I2S+). 다양한 벤치마크에서의 광범위한 실험 결과, I2S와 I2S+는 오픈소스 및 클로즈드소스 모델 모두에서 직접 답변하기와 테스트 시간 스케일링 기준선을 일관되게 능가하는 것으로 나타났습니다. GPT 모델의 경우에도 우리의 방법은 도움이 되었습니다: AIME'25에서 GPT-4.1은 +14.0% 상승했으며, o1-mini는 AIME에서 +2.7%, GPQA에서 +1.7% 향상되었습니다. 이는 인컨텍스트 데모가 통찰-개선-해결 프레임워크를 통해 효과적으로 활용될 수 있음을 시사합니다.

English

Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME'25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.

해로움에서 도움으로: 추론을 위한 인-컨텍스트 데모를 추론 언어 모델의 자산으로 전환하기

From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs

초록

Support