소규모 언어 모델에서의 추론을 위한 강화 학습: 효과적인 접근법과 한계점

초록

대규모 언어 모델(LLM)의 추론 능력을 향상시키는 것은 일반적으로 방대한 컴퓨팅 자원과 광범위한 데이터셋에 의존하며, 이는 자원이 제한된 환경에서의 접근성을 제한합니다. 본 연구는 강화 학습(RL)을 활용하여 소규모 LLM의 추론 능력을 개선할 가능성을 탐구하며, 특히 1.5억 개의 파라미터를 가진 모델인 DeepSeek-R1-Distill-Qwen-1.5B를 엄격한 제약 조건(4개의 NVIDIA A40 GPU, 각각 48GB VRAM, 24시간 이내 훈련) 하에서 분석합니다. Group Relative Policy Optimization(GRPO) 알고리즘을 적용하고, 간결하면서도 고품질의 수학적 추론 데이터셋을 구성하여 모델의 행동과 성능을 탐구하는 세 가지 실험을 수행했습니다. 그 결과, 단 7,000개의 샘플과 $42의 훈련 비용으로도 AMC23 정확도가 63%에서 80%로, AIME24는 46.7%로 향상되어 o1-preview를 능가하는 빠른 추론 능력 향상을 보였습니다. 이는 기존 모델의 수천 달러에 비해 매우 경제적인 대안입니다. 그러나 장기 훈련 시 최적화 불안정성과 길이 제약과 같은 문제가 발생했습니다. 이러한 결과는 소규모 LLM에 대한 RL 기반 미세 조정의 효용성을 강조하며, 대규모 접근 방식에 비해 비용 효율적인 대안을 제시합니다. 우리는 코드와 데이터셋을 오픈소스로 공개하여 트레이드오프에 대한 통찰을 제공하고, 자원이 제한된 환경에서도 확장 가능한 추론 능력을 갖춘 LLM의 기반을 마련했습니다. 모든 자료는 https://github.com/knoveleng/open-rs에서 확인할 수 있습니다.

English

Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.

소규모 언어 모델에서의 추론을 위한 강화 학습: 효과적인 접근법과 한계점

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

초록

Support