LIMO:推理的精簡原則
LIMO: Less is More for Reasoning
February 5, 2025
作者: Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu
cs.AI
摘要
我們提出了一項基本發現,挑戰了我們對於大型語言模型中複雜推理產生的理解。傳統觀點認為,複雜的推理任務需要大量的訓練數據(>100,000個示例),但我們證明複雜的數學推理能力可以用極少的示例有效引發。通過全面的實驗,我們提出的模型LIMO展示了在數學推理方面前所未有的性能。僅使用了817個精心挑選的訓練樣本,LIMO在AIME上達到了57.1%的準確率,在MATH上達到了94.8%,分別優於先前基於SFT的模型的6.5%和59.2%,同時僅使用了先前方法所需訓練數據的1%。LIMO展示了出色的超出分佈泛化能力,在10個不同基準測試中實現了40.5%的絕對改進,優於使用100倍數據訓練的模型,挑戰了SFT導致記憶而非泛化的觀念。基於這些結果,我們提出了“少即是多推理假設”(LIMO Hypothesis):在基礎模型中,領域知識已在預訓練過程中得到全面編碼,複雜的推理能力可以通過最少但精確協調的認知過程示範而出現。該假設認為,複雜推理的引發閾值由兩個關鍵因素確定:(1)模型在預訓練過程中編碼知識基礎的完整性,以及(2)後訓練示例作為“認知模板”的效力,展示模型如何利用其知識庫解決複雜推理任務。為了促進高效推理的可重現性和未來研究,我們將LIMO作為一個全面的開源套件發布在https://github.com/GAIR-NLP/LIMO。
English
We present a fundamental discovery that challenges our understanding of how
complex reasoning emerges in large language models. While conventional wisdom
suggests that sophisticated reasoning tasks demand extensive training data
(>100,000 examples), we demonstrate that complex mathematical reasoning
abilities can be effectively elicited with surprisingly few examples. Through
comprehensive experiments, our proposed model LIMO demonstrates unprecedented
performance in mathematical reasoning. With merely 817 curated training
samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from
previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of
the training data required by previous approaches. LIMO demonstrates
exceptional out-of-distribution generalization, achieving 40.5% absolute
improvement across 10 diverse benchmarks, outperforming models trained on 100x
more data, challenging the notion that SFT leads to memorization rather than
generalization. Based on these results, we propose the Less-Is-More Reasoning
Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has
been comprehensively encoded during pre-training, sophisticated reasoning
capabilities can emerge through minimal but precisely orchestrated
demonstrations of cognitive processes. This hypothesis posits that the
elicitation threshold for complex reasoning is determined by two key factors:
(1) the completeness of the model's encoded knowledge foundation during
pre-training, and (2) the effectiveness of post-training examples as "cognitive
templates" that show the model how to utilize its knowledge base to solve
complex reasoning tasks. To facilitate reproducibility and future research in
data-efficient reasoning, we release LIMO as a comprehensive open-source suite
at https://github.com/GAIR-NLP/LIMO.Summary
AI-Generated Summary