MaPPO: 사전 지식을 활용한 최대사후확률 선호 최적화

초록

사용자를 대신하는 대규모 언어 모델(LLM) 시대가 전개됨에 따라, 선호도 최적화(Preference Optimization, PO) 방법은 LLM을 인간의 선호도에 맞추고 성능을 개선하기 위한 핵심 접근법으로 자리 잡았습니다. 우리는 사전 보상 지식을 명시적으로 최적화 목표에 통합하는 선호도 학습 프레임워크인 최대사후확률 선호도 최적화(Maximum a Posteriori Preference Optimization, MaPPO)를 제안합니다. 기존의 직접 선호도 최적화(Direct Preference Optimization, DPO) 및 그 변형 방법들이 선호도 학습을 최대우도추정(Maximum Likelihood Estimation, MLE) 문제로 다루는 반면, MaPPO는 이 패러다임을 확장하여 사전 보상 추정치를 체계적인 최대사후확률(Maximum a Posteriori, MaP) 목표에 통합합니다. 이는 DPO와 그 변형 방법들을 일반화할 뿐만 아니라, 응답의 지나치게 단순화된 이진 분류를 완화함으로써 정렬 성능을 향상시킵니다. 더 중요한 것은, MaPPO는 추가적인 하이퍼파라미터를 도입하지 않으며, 오프라인 및 온라인 설정 모두에서 선호도 최적화를 지원한다는 점입니다. 또한, MaPPO는 널리 사용되는 SimPO, IPO, CPO를 포함한 DPO 변형 방법들에 일관된 개선을 제공하는 플러그인으로 사용될 수 있습니다. MT-Bench, AlpacaEval 2.0, Arena-Hard를 포함한 세 가지 표준 벤치마크에서 다양한 모델 크기와 모델 시리즈에 대한 광범위한 실험 평가를 통해, 계산 효율성을 희생하지 않으면서도 정렬 성능의 일관된 개선을 입증했습니다.

English

As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

MaPPO: 사전 지식을 활용한 최대사후확률 선호 최적화

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

초록

Support