ChatPaper.aiChatPaper

从损害到助力:将上下文推理示例转化为推理语言模型的资产

From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs

September 27, 2025
作者: Haonan Wang, Weida Liang, Zihang Fu, Nie Zheng, Yifan Zhang, Yao Tong, Tongyao Zhu, Hao Jiang, Chuang Li, Jiaying Wu, Kenji Kawaguchi
cs.AI

摘要

近期,基于推理的大型语言模型(RLMs),尤其是那些通过验证器强化学习训练的模型,在少样本思维链(CoT)下的表现往往不如直接回答。我们利用DeepSeek-R1提供的高质量推理轨迹作为示例,重新审视了这一悖论,发现增加示例数量反而持续降低准确性,即便示例本身是最优的。深入分析揭示了导致这一下降的两大机制:(一)语义误导,即高文本相似性使模型将目标问题视为与示例相同,从而逐字复制中间步骤;(二)策略迁移失败,即模型难以提取有效的推理策略并将其应用于目标问题。基于此,我们提出了“洞察至解决”(Insight-to-Solve, I2S),一种顺序测试时处理流程,它将示例转化为明确、可复用的洞察,并生成针对特定目标的推理轨迹;可选地,推理过程会自我优化以确保连贯性与正确性(I2S+)。在多样化的基准测试中,I2S和I2S+均一致超越了直接回答及测试时扩展基线,适用于开源与闭源模型。即便是GPT模型,我们的方法也显著提升其表现:在AIME'25上,GPT-4.1提升了+14.0%,o1-mini在AIME和GPQA上分别提升了+2.7%和+1.7%,这表明通过“洞察-优化-解决”框架,上下文示例能够被有效利用。
English
Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME'25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.
PDF82September 30, 2025