ChatPaper.aiChatPaper

从同伴中学习的推理模型

Learning from Peers in Reasoning Models

May 12, 2025
作者: Tongxu Luo, Wenyu Du, Jiaxi Bi, Stephen Chung, Zhengyang Tang, Hao Yang, Min Zhang, Benyou Wang
cs.AI

摘要

大型推理模型(LRMs)具备自我纠错的能力,即便在推理路径中出现错误。然而,我们的研究表明,当推理过程始于一个简短但质量不佳的开端时,模型便难以恢复。我们将此现象称为“前缀主导陷阱”。受心理学发现启发,即同伴互动能在不影响已准确个体的前提下促进自我纠正,我们提出了**同伴学习**(LeaP)以应对这一现象。具体而言,每隔若干标记,每条推理路径会总结其中间推理,并通过路由机制与其他路径共享,使推理过程中能够融入同伴的见解。然而,我们注意到较小模型有时无法有效遵循总结与反思指令。为此,我们将其微调为**LeaP-T**模型系列。在AIME 2024、AIME 2025、AIMO 2025及GPQA Diamond上的实验表明,LeaP带来了显著提升。例如,搭载LeaP的QwQ-32B平均比基线高出近5个绝对百分点,并在三个数学基准上超越DeepSeek-R1-671B,平均增益达3.3分。值得注意的是,我们微调后的LeaP-T-7B在AIME 2024上的表现与DeepSeek-R1-Distill-Qwen-14B相当。深入分析揭示了LeaP通过及时获取同伴见解实现稳健的错误纠正,展现出强大的错误容忍度及应对不同任务难度的能力。LeaP标志着LRMs在推理过程中实现协作的里程碑。我们的代码、数据集及模型已发布于https://learning-from-peers.github.io/。
English
Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .

Summary

AI-Generated Summary

PDF332May 13, 2025