监督式微调与强化学习：大语言模型后训练方法比较研究

摘要

预训练大语言模型展现出广泛能力，但在特定任务或领域实现更高精度与更可靠推理，通常需通过监督微调或强化学习进行后训练。尽管这两种方法常被视为独立体系，最新理论与实证研究表明它们存在深刻关联。本研究提出关于监督微调与强化学习的统一视角，系统阐述大语言模型后训练框架。首先深入解析两种技术的目标函数、算法结构与数据需求，继而系统分析其相互作用，重点探讨融合监督微调与强化学习的集成框架、混合训练流程以及优势互补方法。基于2023至2025年间具代表性的应用研究，我们识别新兴趋势，刻画后训练范式向混合方法快速转型的特征，并提炼关键准则以阐明不同方法的适用场景与效能原理。通过整合理论洞见、实践方法与实证证据，本研究在统一框架内建立对监督微调与强化学习的连贯认知，为可扩展、高效率、强泛化的大语言模型后训练指明未来研究方向。

English

Pre-trained Large Language Model (LLM) exhibits broad capabilities, yet, for specific tasks or domains their attainment of higher accuracy and more reliable reasoning generally depends on post-training through Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). Although often treated as distinct methodologies, recent theoretical and empirical developments demonstrate that SFT and RL are closely connected. This study presents a comprehensive and unified perspective on LLM post-training with SFT and RL. We first provide an in-depth overview of both techniques, examining their objectives, algorithmic structures, and data requirements. We then systematically analyze their interplay, highlighting frameworks that integrate SFT and RL, hybrid training pipelines, and methods that leverage their complementary strengths. Drawing on a representative set of recent application studies from 2023 to 2025, we identify emerging trends, characterize the rapid shift toward hybrid post-training paradigms, and distill key takeaways that clarify when and why each method is most effective. By synthesizing theoretical insights, practical methodologies, and empirical evidence, this study establishes a coherent understanding of SFT and RL within a unified framework and outlines promising directions for future research in scalable, efficient, and generalizable LLM post-training.

监督式微调与强化学习：大语言模型后训练方法比较研究

Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

摘要

Support