监督微调与强化学习：大型语言模型后训练方法比较研究

摘要

預訓練大型語言模型（LLM）已展現出廣泛的能力，然而在特定任務或領域中，要實現更高精度與更可靠的推理，通常需透過監督式微調（SFT）或強化學習（RL）進行後續訓練。儘管這兩種方法常被視為獨立技術，但最新理論與實證研究顯示，SFT與RL存在緊密關聯。本研究提出一個全面且統一的視角，探討結合SFT與RL的LLM後訓練框架。我們首先深入剖析兩種技術的目標、算法結構與數據需求，接著系統性分析其相互作用，重點闡釋整合SFT與RL的框架、混合訓練流程，以及發揮兩者互補優勢的方法。透過選取2023至2025年間具代表性的實證研究，我們歸納出新興趨勢，刻畫後訓練範式快速向混合模式轉型的特徵，並提煉出關鍵結論，明確指出每種方法最適用的情境與優勢根源。本研究融合理論洞見、實務方法與實證依據，在統一框架下建立對SFT與RL的連貫理解，最後為可擴展、高效率且具泛化能力的LLM後訓練技術，指出極具潛力的未來研究方向。

English

Pre-trained Large Language Model (LLM) exhibits broad capabilities, yet, for specific tasks or domains their attainment of higher accuracy and more reliable reasoning generally depends on post-training through Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). Although often treated as distinct methodologies, recent theoretical and empirical developments demonstrate that SFT and RL are closely connected. This study presents a comprehensive and unified perspective on LLM post-training with SFT and RL. We first provide an in-depth overview of both techniques, examining their objectives, algorithmic structures, and data requirements. We then systematically analyze their interplay, highlighting frameworks that integrate SFT and RL, hybrid training pipelines, and methods that leverage their complementary strengths. Drawing on a representative set of recent application studies from 2023 to 2025, we identify emerging trends, characterize the rapid shift toward hybrid post-training paradigms, and distill key takeaways that clarify when and why each method is most effective. By synthesizing theoretical insights, practical methodologies, and empirical evidence, this study establishes a coherent understanding of SFT and RL within a unified framework and outlines promising directions for future research in scalable, efficient, and generalizable LLM post-training.

监督微调与强化学习：大型语言模型后训练方法比较研究

Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

摘要

Support