ChatPaper.aiChatPaper

從同儕中學習推理模型

Learning from Peers in Reasoning Models

May 12, 2025
作者: Tongxu Luo, Wenyu Du, Jiaxi Bi, Stephen Chung, Zhengyang Tang, Hao Yang, Min Zhang, Benyou Wang
cs.AI

摘要

大型推理模型(LRMs)具備自我修正的能力,即使在推理過程中出現錯誤也能自行調整。然而,我們的研究發現,當推理過程始於一個簡短但品質不佳的開端時,模型便難以恢復正確的推理軌跡。我們將此現象稱為「前綴主導陷阱」。受心理學研究中「同儕互動能促進自我修正且不影響已正確個體」的啟發,我們提出了**同儕學習**(LeaP)來解決這一問題。具體而言,每個推理路徑在推理過程中會定期總結其中間推理結果,並通過路由機制與其他路徑分享,從而在推理過程中融入同儕的見解。然而,我們觀察到較小的模型有時無法有效遵循總結與反思的指令。為此,我們將其微調為**LeaP-T**模型系列。在AIME 2024、AIME 2025、AIMO 2025和GPQA Diamond等數據集上的實驗表明,LeaP帶來了顯著的性能提升。例如,配備LeaP的QwQ-32B平均比基準模型高出近5個絕對點,並在三個數學基準上超越DeepSeek-R1-671B,平均增益達3.3點。值得注意的是,我們微調後的LeaP-T-7B在AIME 2024上的表現與DeepSeek-R1-Distill-Qwen-14B相當。深入分析顯示,LeaP通過及時的同儕見解實現了強大的錯誤修正能力,展現出良好的錯誤容忍度以及對不同任務難度的適應性。LeaP標誌著大型推理模型在推理過程中實現協作的重要里程碑。我們的代碼、數據集和模型已公開於https://learning-from-peers.github.io/。
English
Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .

Summary

AI-Generated Summary

PDF342May 13, 2025