작은 모델은 강력한 추론자로부터 학습하는 데 어려움을 겪는다.

초록

대형 언어 모델(LLM)은 복잡한 추론 작업에서 뛰어난 성능을 보이며, 이러한 추론 능력을 더 작은 모델로 전이하는 것이 유망한 접근법으로 알려져 있습니다. 그러나 우리는 흥미로운 현상을 발견했는데, 이를 '소형 모델 학습 가능성 격차(Small Model Learnability Gap)'라고 명명했습니다. 소형 모델(3B 파라미터 이하)은 긴 사고 연쇄(Chain-of-Thought, CoT) 추론이나 대형 모델로부터의 지식 증류를 통해 일관되게 이점을 얻지 못하는 것으로 나타났습니다. 대신, 이러한 모델들은 본질적인 학습 능력에 더 잘 맞는 짧고 단순한 추론 사슬에 미세 조정(fine-tuning)을 수행할 때 더 나은 성능을 보였습니다. 이를 해결하기 위해 우리는 '혼합 증류(Mix Distillation)'라는 간단하면서도 효과적인 전략을 제안합니다. 이 방법은 긴 CoT 예제와 짧은 CoT 예제를 결합하거나, 대형 모델과 소형 모델의 추론을 혼합함으로써 추론 복잡성을 균형 있게 조정합니다. 실험 결과, 혼합 증류는 단일 데이터로만 학습한 경우에 비해 소형 모델의 추론 성능을 크게 향상시키는 것으로 나타났습니다. 이러한 발견은 강력한 모델로부터의 직접적인 지식 증류의 한계를 드러내며, 효과적인 추론 능력 전이를 위해 추론 복잡성을 적절히 조정하는 것의 중요성을 강조합니다.

English

Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models (leq3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.

작은 모델은 강력한 추론자로부터 학습하는 데 어려움을 겪는다.

Small Models Struggle to Learn from Strong Reasoners

초록

Support