AMFT：通过元学习优化模仿与探索平衡来对齐大语言模型推理器

摘要

大型语言模型（LLMs）通常通过监督微调（SFT）和强化学习（RL）两阶段流程进行推理任务的微调，这一过程常伴随灾难性遗忘及模仿与探索间的次优权衡问题。近期单阶段方法尝试利用启发式策略统一SFT与RL，却缺乏动态平衡这两种范式的原则性机制。本文从隐含奖励的理论视角重新审视这一挑战，将SFT与RL视为互补的奖励信号，而非独立的方法。我们提出了自适应元微调（AMFT），一种新颖的单阶段算法，旨在学习SFT隐含的路径级奖励与RL显式的结果导向奖励之间的最优平衡。AMFT的核心是一个元梯度自适应权重控制器，它将SFT-RL平衡视为可学习参数，动态优化以最大化长期任务性能。这一前瞻性方法，通过策略熵正则化确保稳定性，自主发现有效的训练课程。我们在涵盖数学推理、抽象视觉推理（通用点）及视觉语言导航（V-IRL）的挑战性基准上进行了全面评估。AMFT持续确立新的技术标杆，并在分布外（OOD）任务上展现出卓越的泛化能力。消融研究与训练动态分析证实，元学习控制器对AMFT的稳定性、样本效率及性能至关重要，为LLM对齐提供了更为原则性和有效的范式。我们的代码已开源，详见https://github.com/hlxtsyj/AMFT。

English

Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of implicit rewards, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce Adaptive Meta Fine-Tuning (AMFT), a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a meta-gradient adaptive weight controller that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT's stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment.Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.

AMFT：通过元学习优化模仿与探索平衡来对齐大语言模型推理器

AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance

摘要

Support