小型模型難以從強大的推理者中學習

摘要

大型語言模型（LLMs）在複雜推理任務中表現卓越，將其推理能力蒸餾至較小模型已顯示出潛力。然而，我們發現了一個有趣的現象，稱之為「小模型可學習性差距」：參數規模較小的模型（≤3B）並未一致地受益於長鏈思維（CoT）推理或從更大模型的蒸餾。相反，當這些小模型在更短、更簡單的推理鏈上進行微調，更符合其內在學習能力時，它們的表現更佳。為此，我們提出了混合蒸餾（Mix Distillation），這是一種簡單而有效的策略，通過結合長短CoT示例或大小模型的推理，來平衡推理的複雜性。我們的實驗表明，與單獨使用任一數據訓練相比，混合蒸餾顯著提升了小模型的推理性能。這些發現揭示了直接強模型蒸餾的局限性，並強調了調整推理複雜性對於有效推理能力轉移的重要性。

English

Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models (leq3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.

小型模型難以從強大的推理者中學習

Small Models Struggle to Learn from Strong Reasoners

摘要

Support