LIMO: 推論においては少ない方が良い

要旨

大規模言語モデルにおいて複雑な推論がどのようにして生じるかという理解に挑戦する基本的な発見を提供します。従来の常識では、洗練された推論タスクには多大なトレーニングデータ（>100,000の例）が必要とされると考えられていますが、我々は驚くほど少ない例で複雑な数学的推論能力を効果的に引き出すことができることを実証します。包括的な実験を通じて、提案されたモデルLIMOは数学的推論において前例のない性能を示します。わずか817の選別されたトレーニングサンプルで、LIMOはAIMEで57.1%、MATHで94.8%の精度を達成し、以前のSFTベースのモデルのそれぞれ6.5%と59.2%から向上させます。これは、従来の手法に比べて必要なトレーニングデータの1%しか使用していません。LIMOは、10の異なるベンチマークで40.5%の絶対改善を達成し、100倍のデータでトレーニングされたモデルを上回り、SFTが汎化ではなく記憶につながるという概念に挑戦します。これらの結果に基づき、我々は「Less-Is-More Reasoning Hypothesis（LIMO仮説）」を提案します。この仮説は、事前トレーニング中にドメイン知識が包括的にエンコードされた基礎モデルにおいて、洗練された推論能力が、認知プロセスの最小限のが適切に編成されたデモンストレーションを通じて生じる可能性があるというものです。この仮説は、複雑な推論の引き出しの閾値が、(1) モデルの事前トレーニング中にエンコードされた知識基盤の完全性、および(2) ポストトレーニングの例がモデルに知識ベースを活用して複雑な推論タスクを解決する方法を示す「認知テンプレート」としての効果によって決定されると仮定しています。データ効率の高い推論の再現性と将来の研究を促進するために、我々はLIMOを包括的なオープンソーススイートとしてリリースします。

English

We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (>100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model's encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as "cognitive templates" that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at https://github.com/GAIR-NLP/LIMO.

LIMO: 推論においては少ない方が良い

LIMO: Less is More for Reasoning

要旨

Support