推論モデルにおけるピア学習

要旨

大規模推論モデル（LRM）は、推論経路でミスを犯した場合でも自己修正する能力を持っています。しかし、本研究では、推論プロセスが短いが質の低い始まり方をした場合、モデルが回復することが困難になることを明らかにしました。我々はこの現象を「プレフィックス支配の罠」と呼びます。心理学的研究から、ピアとの相互作用が既に正確な個人に悪影響を与えることなく自己修正を促進できるという知見に着想を得て、この現象に対処するために**ピアからの学習**（LeaP）を提案します。具体的には、各トークンごとに、各推論経路が中間推論を要約し、ルーティングメカニズムを通じて他の経路と共有することで、推論中にピアの洞察を取り入れることを可能にします。しかし、小規模なモデルでは、要約と反省の指示を効果的に実行できない場合があることが観察されました。これを解決するため、我々はそれらを**LeaP-T**モデルシリーズにファインチューニングしました。AIME 2024、AIME 2025、AIMO 2025、およびGPQA Diamondでの実験により、LeaPが大幅な改善をもたらすことが示されました。例えば、LeaPを適用したQwQ-32Bは、ベースラインよりも平均で約5ポイント高く、3つの数学ベンチマークでDeepSeek-R1-671Bを平均3.3ポイント上回りました。特に、ファインチューニングしたLeaP-T-7Bは、AIME 2024においてDeepSeek-R1-Distill-Qwen-14Bの性能に匹敵しました。詳細な分析により、LeaPがタイムリーなピアの洞察による堅牢なエラー修正を示し、強いエラー耐性と多様なタスク難易度の処理能力を持つことが明らかになりました。LeaPは、LRMが推論中に協力することを可能にするマイルストーンを記録しました。我々のコード、データセット、およびモデルはhttps://learning-from-peers.github.io/で公開されています。

English

Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .

推論モデルにおけるピア学習

Learning from Peers in Reasoning Models

要旨

Support