AMFT：最適な模倣と探索のバランスをメタ学習により調整するLLM推論システム

要旨

大規模言語モデル（LLM）は、通常、推論タスクに対して、教師ありファインチューニング（SFT）と強化学習（RL）の2段階パイプラインを通じてファインチューニングされます。このプロセスは、破滅的な忘却や模倣と探索の間の最適でないトレードオフに悩まされています。最近の単一段階の手法は、ヒューリスティックを使用してSFTとRLを統合しようと試みていますが、これら2つのパラダイムを動的にバランスさせるための原理的なメカニズムが欠けています。本論文では、この課題を暗黙的報酬の理論的視点を通じて再構築し、SFTとRLを異なる手法ではなく、補完的な報酬信号として捉えます。我々は、Adaptive Meta Fine-Tuning（AMFT）という新しい単一段階アルゴリズムを導入します。AMFTは、SFTの暗黙的なパスレベル報酬とRLの明示的な結果ベース報酬の間の最適なバランスを学習します。AMFTの核心は、SFTとRLのバランスを学習可能なパラメータとして扱い、長期的なタスク性能を最大化するために動的に最適化するメタ勾配適応重みコントローラです。この先見的なアプローチは、安定性のためにポリシーエントロピーによって正則化され、効果的なトレーニングカリキュラムを自律的に発見します。我々は、数学的推論、抽象的視覚推論（General Points）、視覚言語ナビゲーション（V-IRL）にわたる挑戦的なベンチマークで包括的な評価を行いました。AMFTは一貫して新しい最先端を確立し、分布外（OOD）タスクでの優れた汎化能力を示します。アブレーション研究とトレーニング動態分析により、メタ学習コントローラがAMFTの安定性、サンプル効率、および性能にとって重要であることが確認され、LLMのアラインメントのためのより原理的で効果的なパラダイムを提供します。我々のコードはhttps://github.com/hlxtsyj/AMFTで公開されています。

English

Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of implicit rewards, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce Adaptive Meta Fine-Tuning (AMFT), a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a meta-gradient adaptive weight controller that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT's stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment.Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.

AMFT：最適な模倣と探索のバランスをメタ学習により調整するLLM推論システム

AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance

要旨

Support