強化學習在小型語言模型中的推理應用:有效方法與無效策略
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
March 20, 2025
作者: Quy-Anh Dang, Chris Ngo
cs.AI
摘要
提升大型語言模型(LLMs)的推理能力通常依賴於龐大的計算資源和廣泛的數據集,這在資源受限的環境中限制了其可及性。本研究探討了強化學習(RL)在改善小型LLMs推理能力方面的潛力,重點關注一個擁有15億參數的模型——DeepSeek-R1-Distill-Qwen-1.5B,在嚴格限制下:使用4張NVIDIA A40 GPU(每張48GB顯存)在24小時內完成訓練。我們採用了群組相對策略優化(GRPO)算法,並精心挑選了一個緊湊且高質量的數學推理數據集,進行了三項實驗以探索模型的行為和性能。結果顯示,推理能力迅速提升——例如,AMC23的準確率從63%上升至80%,AIME24達到46.7%,超越了o1-preview——僅使用了7,000個樣本和42美元的訓練成本,相比基準模型的數千美元開支顯著降低。然而,隨著訓練時間的延長,出現了優化不穩定性和長度限制等挑戰。這些發現凸顯了基於RL的微調對於小型LLMs的有效性,提供了一種成本效益高的替代方案,相較於大規模方法。我們將代碼和數據集作為開源資源發布,提供了對權衡的深入見解,並為在資源有限環境中構建可擴展、具備推理能力的LLMs奠定了基礎。所有資源均可通過https://github.com/knoveleng/open-rs獲取。
English
Enhancing the reasoning capabilities of large language models (LLMs)
typically relies on massive computational resources and extensive datasets,
limiting accessibility for resource-constrained settings. Our study
investigates the potential of reinforcement learning (RL) to improve reasoning
in small LLMs, focusing on a 1.5-billion-parameter model,
DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA
A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy
Optimization (GRPO) algorithm and curating a compact, high-quality mathematical
reasoning dataset, we conducted three experiments to explore model behavior and
performance. Our results demonstrate rapid reasoning gains - e.g., AMC23
accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing
o1-preview - using only 7,000 samples and a $42 training cost, compared to
thousands of dollars for baseline models. However, challenges such as
optimization instability and length constraints emerged with prolonged
training. These findings highlight the efficacy of RL-based fine-tuning for
small LLMs, offering a cost-effective alternative to large-scale approaches. We
release our code and datasets as open-source resources, providing insights into
trade-offs and laying a foundation for scalable, reasoning-capable LLMs in
resource-limited environments. All are available at
https://github.com/knoveleng/open-rs.Summary
AI-Generated Summary