自信即所需：语言模型的少样本强化学习微调

摘要

大型語言模型（LLMs）在推理方面表現卓越，然而後續訓練對於使其行為與任務目標保持一致仍至關重要。現有的強化學習（RL）方法通常依賴於昂貴的人工標註或外部獎勵模型。我們提出了基於自我信心的強化學習（RLSC），該方法利用模型自身的信心作為獎勵信號，從而消除了對標籤、偏好模型或獎勵工程的需求。應用於Qwen2.5-Math-7B模型，每題僅需16個樣本和10或20次訓練步驟，RLSC在AIME2024上提升了+13.4%的準確率，在MATH500上提升了+21.2%，在Minerva Math上提升了+21.7%，在Olympiadbench上提升了+20.8%，在AMC23上提升了+9.7%。RLSC為推理模型提供了一種簡單、可擴展的後續訓練方法，僅需少量樣本和無標籤的監督。

English

Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.

自信即所需：语言模型的少样本强化学习微调

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

摘要

Support