자신감이 전부다: 언어 모델의 소샷 강화학습 미세 조정

초록

대규모 언어 모델(LLM)은 추론 능력에서 뛰어난 성과를 보이지만, 과제 목표와의 행동 정렬을 위해 사후 훈련은 여전히 중요합니다. 기존의 강화 학습(RL) 방법은 비용이 많이 드는 인간 주석이나 외부 보상 모델에 의존하는 경우가 많습니다. 본 연구에서는 모델의 자신감을 보상 신호로 활용하는 자기 신뢰 기반 강화 학습(RLSC)을 제안합니다. 이 방법은 레이블, 선호도 모델 또는 보상 엔지니어링 없이도 작동할 수 있습니다. Qwen2.5-Math-7B 모델에 질문당 16개의 샘플과 10 또는 20번의 훈련 단계를 적용한 결과, RLSC는 AIME2024에서 +13.4%, MATH500에서 +21.2%, Minerva Math에서 +21.7%, Olympiadbench에서 +20.8%, AMC23에서 +9.7%의 정확도 향상을 달성했습니다. RLSC는 소수의 샘플과 레이블 없는 감독만으로도 추론 모델을 위한 간단하고 확장 가능한 사후 훈련 방법을 제공합니다.

English

Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.

자신감이 전부다: 언어 모델의 소샷 강화학습 미세 조정

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

초록

Support