추론 모델에서 동료 학습

초록

대규모 추론 모델(Large Reasoning Models, LRMs)은 추론 과정에서 실수를 하더라도 스스로 이를 수정할 수 있는 능력을 가지고 있습니다. 그러나 본 연구에 따르면, 추론 과정이 짧지만 부정확한 시작점에서 출발할 경우 모델이 이를 회복하기 어려운 현상이 나타납니다. 우리는 이 현상을 "접두어 지배 함정(Prefix Dominance Trap)"이라고 명명했습니다. 심리학 연구에서 동료 간 상호작용이 이미 정확한 개인에게는 부정적인 영향을 미치지 않으면서도 자기 수정을 촉진할 수 있다는 점에 영감을 받아, 이 현상을 해결하기 위해 **동료 학습(Learning from Peers, LeaP)**을 제안합니다. 구체적으로, 모든 토큰에서 각 추론 경로는 중간 추론 결과를 요약하고 라우팅 메커니즘을 통해 다른 경로와 공유함으로써, 추론 과정에서 동료의 통찰을 반영할 수 있게 합니다. 그러나 더 작은 모델의 경우 요약 및 반영 지시를 효과적으로 따르지 못하는 경우가 관찰되었습니다. 이를 해결하기 위해 우리는 이러한 모델을 **LeaP-T** 모델 시리즈로 미세 조정했습니다. AIME 2024, AIME 2025, AIMO 2025, GPQA Diamond에서의 실험 결과, LeaP는 상당한 성능 향상을 제공하는 것으로 나타났습니다. 예를 들어, LeaP를 적용한 QwQ-32B는 기준선보다 평균적으로 약 5포인트 높은 성능을 보였으며, 세 가지 수학 벤치마크에서 DeepSeek-R1-671B를 평균 3.3포인트 차이로 능가했습니다. 특히, 미세 조정된 LeaP-T-7B는 AIME 2024에서 DeepSeek-R1-Distill-Qwen-14B와 동등한 성능을 보였습니다. 심층 분석 결과, LeaP는 적시에 동료의 통찰을 통해 강력한 오류 수정 능력을 보여주며, 강한 오류 허용력과 다양한 작업 난이도를 처리할 수 있는 것으로 나타났습니다. LeaP는 LRM이 추론 과정에서 협력할 수 있게 함으로써 중요한 이정표를 세웠습니다. 우리의 코드, 데이터셋, 모델은 https://learning-from-peers.github.io/에서 확인할 수 있습니다.

English

Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .

추론 모델에서 동료 학습

Learning from Peers in Reasoning Models

초록

Support