SRFT：一种融合监督与强化微调的单阶段推理方法

摘要

大型语言模型（LLMs）在推理任务中取得了显著进展，然而如何最优地整合监督微调（SFT）与强化学习（RL）仍是一个根本性挑战。通过从基于熵的视角对标记分布、学习动态及整合机制进行全面分析，我们揭示了这两种范式之间的关键差异：SFT引发LLM策略分布的粗粒度全局变化，而RL则执行细粒度的选择性优化，其中熵作为训练效果的关键指标。基于这些观察，我们提出了监督强化微调（SRFT），这是一种通过熵感知加权机制统一两种微调范式的单阶段方法。我们的方法同时应用SFT和RL，直接利用演示和自我探索的轨迹来优化LLM，而非采用两阶段顺序方法。大量实验表明，SRFT在五个数学推理基准测试中平均准确率达到59.1%，较无RL方法高出9.0%，在三个分布外基准测试中则高出10.9%。

English

Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT induces coarse-grained global changes to LLM policy distributions, while RL performs fine-grained selective optimizations, with entropy serving as a critical indicator of training effectiveness. Building on these observations, we propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach simultaneously applies SFT and RL to directly optimize the LLM using demonstrations and self-exploration rollouts rather than through two-stage sequential methods. Extensive experiments show that SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks.

SRFT：一种融合监督与强化微调的单阶段推理方法

SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

摘要

Support