AMFT: 최적의 모방-탐색 균형 메타러닝을 통한 LLM 추론기 정렬

초록

대규모 언어 모델(LLMs)은 일반적으로 지도 미세 조정(Supervised Fine-Tuning, SFT)과 강화 학습(Reinforcement Learning, RL)의 두 단계 파이프라인을 통해 추론 작업에 맞게 미세 조정됩니다. 이 과정은 치명적인 망각(catastrophic forgetting)과 모방과 탐색 사이의 최적이 아닌 균형 문제로 가득 차 있습니다. 최근의 단일 단계 방법들은 휴리스틱을 사용하여 SFT와 RL을 통합하려고 시도하지만, 두 패러다임을 동적으로 균형 잡는 원칙적인 메커니즘이 부족합니다. 본 논문에서는 이 문제를 암시적 보상(implicit rewards)의 이론적 관점에서 재해석하여, SFT와 RL을 별개의 방법이 아닌 상호 보완적인 보상 신호로 간주합니다. 우리는 SFT의 암시적 경로 수준 보상과 RL의 명시적 결과 기반 보상 사이의 최적 균형을 학습하는 새로운 단일 단계 알고리즘인 적응형 메타 미세 조정(Adaptive Meta Fine-Tuning, AMFT)을 소개합니다. AMFT의 핵심은 SFT-RL 균형을 학습 가능한 매개변수로 취급하여 장기적인 작업 성능을 극대화하기 위해 동적으로 최적화하는 메타 그래디언트 적응 가중치 컨트롤러입니다. 이 전향적인 접근 방식은 안정성을 위해 정책 엔트로피(policy entropy)로 정규화되며, 효과적인 훈련 커리큘럼을 자율적으로 발견합니다. 우리는 수학적 추론, 추상적 시각적 추론(General Points), 그리고 시각-언어 내비게이션(V-IRL)을 아우르는 도전적인 벤치마크에서 포괄적인 평가를 수행했습니다. AMFT는 일관되게 새로운 최첨단 기술을 확립하고, 분포 외(OOD) 작업에서 우수한 일반화 능력을 입증했습니다. 제거 연구(ablation studies)와 훈련 동적 분석은 메타 학습 컨트롤러가 AMFT의 안정성, 샘플 효율성 및 성능에 중요한 역할을 하며, LLM 정렬을 위한 더 원칙적이고 효과적인 패러다임을 제공한다는 것을 확인했습니다. 우리의 코드는 https://github.com/hlxtsyj/AMFT에서 오픈소스로 제공됩니다.

English

Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of implicit rewards, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce Adaptive Meta Fine-Tuning (AMFT), a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a meta-gradient adaptive weight controller that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT's stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment.Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.

AMFT: 최적의 모방-탐색 균형 메타러닝을 통한 LLM 추론기 정렬

AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance

초록

Support