強化學習在小型語言模型中的推理應用：有效方法與無效策略

摘要

提升大型語言模型（LLMs）的推理能力通常依賴於龐大的計算資源和廣泛的數據集，這在資源受限的環境中限制了其可及性。本研究探討了強化學習（RL）在改善小型LLMs推理能力方面的潛力，重點關注一個擁有15億參數的模型——DeepSeek-R1-Distill-Qwen-1.5B，在嚴格限制下：使用4張NVIDIA A40 GPU（每張48GB顯存）在24小時內完成訓練。我們採用了群組相對策略優化（GRPO）算法，並精心挑選了一個緊湊且高質量的數學推理數據集，進行了三項實驗以探索模型的行為和性能。結果顯示，推理能力迅速提升——例如，AMC23的準確率從63%上升至80%，AIME24達到46.7%，超越了o1-preview——僅使用了7,000個樣本和42美元的訓練成本，相比基準模型的數千美元開支顯著降低。然而，隨著訓練時間的延長，出現了優化不穩定性和長度限制等挑戰。這些發現凸顯了基於RL的微調對於小型LLMs的有效性，提供了一種成本效益高的替代方案，相較於大規模方法。我們將代碼和數據集作為開源資源發布，提供了對權衡的深入見解，並為在資源有限環境中構建可擴展、具備推理能力的LLMs奠定了基礎。所有資源均可通過https://github.com/knoveleng/open-rs獲取。

English

Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.

強化學習在小型語言模型中的推理應用：有效方法與無效策略

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

摘要

Support