AMFT：通過元學習最佳模仿-探索平衡來對齊大型語言模型推理器

摘要

大型語言模型（LLMs）通常通過一個兩階段流程進行推理任務的微調，即先進行監督式微調（SFT），再進行強化學習（RL），這一過程常伴隨著災難性遺忘以及模仿與探索之間次優的權衡問題。近期的單階段方法嘗試利用啟發式策略統一SFT與RL，但缺乏一種原則性的機制來動態平衡這兩種範式。本文中，我們通過隱性獎勵的理論視角重新審視這一挑戰，將SFT與RL視為互補的獎勵信號，而非截然不同的方法。我們提出了自適應元微調（AMFT），這是一種新穎的單階段算法，它學習SFT的隱性路徑級獎勵與RL的顯性基於結果的獎勵之間的最優平衡。AMFT的核心是一個元梯度自適應權重控制器，它將SFT-RL平衡視為可學習參數，並動態優化以最大化長期任務表現。這一前瞻性方法，通過策略熵進行正則化以確保穩定性，自主發現了有效的訓練課程。我們在涵蓋數學推理、抽象視覺推理（General Points）及視覺語言導航（V-IRL）的挑戰性基準上進行了全面評估。AMFT在這些任務上持續創建了新的技術標杆，並在分佈外（OOD）任務上展現出卓越的泛化能力。消融研究與訓練動態分析證實，元學習控制器對於AMFT的穩定性、樣本效率及性能至關重要，為LLM對齊提供了一種更為原則性且有效的範式。我們的代碼已通過https://github.com/hlxtsyj/AMFT開源。

English

Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of implicit rewards, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce Adaptive Meta Fine-Tuning (AMFT), a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a meta-gradient adaptive weight controller that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT's stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment.Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.

AMFT：通過元學習最佳模仿-探索平衡來對齊大型語言模型推理器

AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance

摘要

Support