教師によるファインチューニングと強化学習：大規模言語モデルのポストトレーニング手法に関する研究

要旨

事前学習済み大規模言語モデル（LLM）は広範な能力を示すが、特定のタスクや領域においてより高い精度と信頼性の高い推論を実現するには、一般に教師ありファインチューニング（SFT）または強化学習（RL）による事後学習が不可欠である。これらはしばしば別個の手法として扱われるが、近年の理論的・実証的研究はSFTとRLが密接に関連していることを示している。本研究は、SFTとRLによるLLM事後学習に関する包括的かつ統一的な視点を提示する。まず、両技術の目的、アルゴリズム構造、データ要件を検討し、詳細な概観を示す。次に、SFTとRLを統合するフレームワーク、ハイブリッド訓練パイプライン、両者の相補的強みを活用する方法に焦点を当て、その相互作用を体系的に分析する。2023年から2025年までの代表的な応用研究を基に、新興トレンドを特定し、ハイブリッド事後学習パラダイムへの急速な移行を特徴付け、各手法が最も効果的な状況とその理由を明確化する重要な知見を抽出する。理論的洞察、実践的方法論、実証的証拠を統合することにより、本研究は統一フレームワーク内でのSFTとRLの首尾一貫した理解を確立し、拡張性・効率性・一般化性に優れたLLM事後学習の将来研究に向けた有望な方向性を提示する。

English

Pre-trained Large Language Model (LLM) exhibits broad capabilities, yet, for specific tasks or domains their attainment of higher accuracy and more reliable reasoning generally depends on post-training through Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). Although often treated as distinct methodologies, recent theoretical and empirical developments demonstrate that SFT and RL are closely connected. This study presents a comprehensive and unified perspective on LLM post-training with SFT and RL. We first provide an in-depth overview of both techniques, examining their objectives, algorithmic structures, and data requirements. We then systematically analyze their interplay, highlighting frameworks that integrate SFT and RL, hybrid training pipelines, and methods that leverage their complementary strengths. Drawing on a representative set of recent application studies from 2023 to 2025, we identify emerging trends, characterize the rapid shift toward hybrid post-training paradigms, and distill key takeaways that clarify when and why each method is most effective. By synthesizing theoretical insights, practical methodologies, and empirical evidence, this study establishes a coherent understanding of SFT and RL within a unified framework and outlines promising directions for future research in scalable, efficient, and generalizable LLM post-training.

教師によるファインチューニングと強化学習：大規模言語モデルのポストトレーニング手法に関する研究

Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

要旨

Support