自信即所需：语言模型的少样本强化学习微调

摘要

大型语言模型（LLMs）在推理方面表现出色，但训练后的调整对于使其行为与任务目标保持一致仍然至关重要。现有的强化学习（RL）方法通常依赖于昂贵的人工标注或外部奖励模型。我们提出了基于自我置信度的强化学习（RLSC），该方法利用模型自身的置信度作为奖励信号，从而无需标签、偏好模型或奖励工程。应用于Qwen2.5-Math-7B模型时，仅需每个问题16个样本和10或20次训练步骤，RLSC在AIME2024上的准确率提升了+13.4%，在MATH500上提升了+21.2%，在Minerva Math上提升了+21.7%，在Olympiadbench上提升了+20.8%，在AMC23上提升了+9.7%。RLSC为推理模型提供了一种简单、可扩展的训练后调整方法，仅需少量样本和无标注监督。

English

Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.

自信即所需：语言模型的少样本强化学习微调

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

摘要

Support