小型模型難以從強大的推理者中學習
Small Models Struggle to Learn from Strong Reasoners
February 17, 2025
作者: Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran
cs.AI
摘要
大型語言模型(LLMs)在複雜推理任務中表現卓越,將其推理能力蒸餾至較小模型已顯示出潛力。然而,我們發現了一個有趣的現象,稱之為「小模型可學習性差距」:參數規模較小的模型(≤3B)並未一致地受益於長鏈思維(CoT)推理或從更大模型的蒸餾。相反,當這些小模型在更短、更簡單的推理鏈上進行微調,更符合其內在學習能力時,它們的表現更佳。為此,我們提出了混合蒸餾(Mix Distillation),這是一種簡單而有效的策略,通過結合長短CoT示例或大小模型的推理,來平衡推理的複雜性。我們的實驗表明,與單獨使用任一數據訓練相比,混合蒸餾顯著提升了小模型的推理性能。這些發現揭示了直接強模型蒸餾的局限性,並強調了調整推理複雜性對於有效推理能力轉移的重要性。
English
Large language models (LLMs) excel in complex reasoning tasks, and distilling
their reasoning capabilities into smaller models has shown promise. However, we
uncover an interesting phenomenon, which we term the Small Model Learnability
Gap: small models (leq3B parameters) do not consistently benefit from long
chain-of-thought (CoT) reasoning or distillation from larger models. Instead,
they perform better when fine-tuned on shorter, simpler reasoning chains that
better align with their intrinsic learning capacity. To address this, we
propose Mix Distillation, a simple yet effective strategy that balances
reasoning complexity by combining long and short CoT examples or reasoning from
both larger and smaller models. Our experiments demonstrate that Mix
Distillation significantly improves small model reasoning performance compared
to training on either data alone. These findings highlight the limitations of
direct strong model distillation and underscore the importance of adapting
reasoning complexity for effective reasoning capability transfer.Summary
AI-Generated Summary