小规模语言模型中的强化学习推理:有效方法与局限分析
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
March 20, 2025
作者: Quy-Anh Dang, Chris Ngo
cs.AI
摘要
提升大型语言模型(LLMs)的推理能力通常依赖于庞大的计算资源和海量数据集,这在资源受限的环境中限制了其可及性。本研究探讨了强化学习(RL)在提升小型LLMs推理能力方面的潜力,重点关注一个拥有15亿参数的模型——DeepSeek-R1-Distill-Qwen-1.5B,在严格约束条件下:使用4块NVIDIA A40 GPU(每块48GB显存)在24小时内完成训练。通过调整群体相对策略优化(GRPO)算法并精心构建一个紧凑且高质量的数学推理数据集,我们进行了三项实验以探索模型的行为与性能。结果显示,仅使用7,000个样本和42美元的训练成本,模型在推理能力上取得了快速提升——例如,AMC23准确率从63%升至80%,AIME24达到46.7%,超越了o1-preview模型,而基线模型的训练成本则需数千美元。然而,随着训练时间的延长,优化不稳定性和长度限制等挑战也随之显现。这些发现凸显了基于RL的微调对于小型LLMs的有效性,为大规模方法提供了一种经济高效的替代方案。我们已将代码和数据集作为开源资源发布,深入探讨了权衡取舍,并为在资源有限的环境中构建可扩展、具备推理能力的LLMs奠定了基础。所有资源均可在https://github.com/knoveleng/open-rs获取。
English
Enhancing the reasoning capabilities of large language models (LLMs)
typically relies on massive computational resources and extensive datasets,
limiting accessibility for resource-constrained settings. Our study
investigates the potential of reinforcement learning (RL) to improve reasoning
in small LLMs, focusing on a 1.5-billion-parameter model,
DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA
A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy
Optimization (GRPO) algorithm and curating a compact, high-quality mathematical
reasoning dataset, we conducted three experiments to explore model behavior and
performance. Our results demonstrate rapid reasoning gains - e.g., AMC23
accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing
o1-preview - using only 7,000 samples and a $42 training cost, compared to
thousands of dollars for baseline models. However, challenges such as
optimization instability and length constraints emerged with prolonged
training. These findings highlight the efficacy of RL-based fine-tuning for
small LLMs, offering a cost-effective alternative to large-scale approaches. We
release our code and datasets as open-source resources, providing insights into
trade-offs and laying a foundation for scalable, reasoning-capable LLMs in
resource-limited environments. All are available at
https://github.com/knoveleng/open-rs.Summary
AI-Generated Summary