大型語言模型(LLMs)可以輕鬆從示範中學會推理,結構比內容更重要!
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
February 11, 2025
作者: Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
cs.AI
摘要
大型推理模型(LRMs)通過遵循包含反思、回溯和自我驗證的長思維鏈(Long CoT)來應對複雜的推理問題。然而,引發長思維鏈所需的訓練技術和數據要求仍然知之甚少。在這項工作中,我們發現大型語言模型(LLM)可以通過數據高效的監督微調(SFT)和參數高效的低秩適應(LoRA)有效地學習長思維鏈推理。僅通過17k個長思維鏈訓練樣本,Qwen2.5-32B-Instruct模型在廣泛的數學和編碼基準測試中實現了顯著改進,包括AIME 2024的56.7%(+40.0%)和LiveCodeBench的57.0%(+8.1%),與專有的o1-preview模型的44.6%和59.1%的得分相競爭。更重要的是,我們發現長思維鏈的結構對學習過程至關重要,而個別推理步驟的內容影響較小。影響內容的擾動,如在不正確的樣本上進行訓練或刪除推理關鍵詞,對性能幾乎沒有影響。相反,破壞長思維鏈中的邏輯一致性的結構修改,如混亂或刪除推理步驟,會顯著降低準確性。例如,在具有不正確答案的長思維鏈樣本上訓練的模型,其準確性僅比使用完全正確樣本訓練時低3.2%。這些見解加深了我們對如何引發LLMs中的推理能力的理解,並突出了有效訓練下一代推理模型的關鍵考慮因素。這是我們之前發布的Sky-T1-32B-Preview模型的學術論文。代碼可在https://github.com/NovaSky-AI/SkyThought找到。
English
Large reasoning models (LRMs) tackle complex reasoning problems by following
long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking,
and self-validation. However, the training techniques and data requirements to
elicit Long CoT remain poorly understood. In this work, we find that a Large
Language model (LLM) can effectively learn Long CoT reasoning through
data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank
adaptation (LoRA). With just 17k long CoT training samples, the
Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of
math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0%
(+8.1%) on LiveCodeBench, competitive to the proprietary o1-preview model's
score of 44.6% and 59.1%. More importantly, we find that the structure of Long
CoT is critical to the learning process, whereas the content of individual
reasoning steps has minimal impact. Perturbations affecting content, such as
training on incorrect samples or removing reasoning keywords, have little
impact on performance. In contrast, structural modifications that disrupt
logical consistency in the Long CoT, such as shuffling or deleting reasoning
steps, significantly degrade accuracy. For example, a model trained on Long CoT
samples with incorrect answers still achieves only 3.2% lower accuracy compared
to training with fully correct samples. These insights deepen our understanding
of how to elicit reasoning capabilities in LLMs and highlight key
considerations for efficiently training the next generation of reasoning
models. This is the academic paper of our previous released Sky-T1-32B-Preview
model. Codes are available at https://github.com/NovaSky-AI/SkyThought.Summary
AI-Generated Summary