데이터에서 보상으로: 최대 우도 추정에 대한 이중 최적화 관점

초록

생성 모델은 텍스트, 비전, 그리고 멀티모달 애플리케이션에서 최첨단 시스템을 뒷받침하는 현대 기계 학습의 핵심을 이루고 있습니다. 전통적으로 최대 가능도 추정(Maximum Likelihood Estimation)이 지배적인 훈련 패러다임으로 자리 잡아 왔지만, 최근 연구들은 특히 일반화 능력과 재난적 망각(catastrophic forgetting)에 대한 취약성 측면에서 그 한계를 지적하고 있습니다. 이는 정책 경사(Policy Gradient) 방법과 같은 강화 학습 기법과 비교할 때 두드러집니다. 그러나 이러한 접근법들은 명시적인 보상 신호에 의존하는데, 이는 실제로는 종종 사용할 수 없어, 고품질 데이터셋만 접근 가능할 때 생성 모델을 어떻게 정렬할지에 대한 근본적인 문제를 남깁니다. 본 연구에서는 이 문제를 이중 수준 최적화(Bilevel Optimization) 프레임워크를 통해 해결합니다. 여기서 보상 함수는 외부 수준 문제의 최적화 변수로 취급되고, 정책 경사 목적 함수는 내부 수준을 정의합니다. 그런 다음, 우리는 이 최적화 문제를 이론적으로 분석하여, 표 형태 분류(tabular classification) 및 모델 기반 강화 학습과 같은 애플리케이션에 일반화할 수 있는 통찰을 도출합니다. 우리는 코드를 https://github.com/abenechehab/nll_to_po 에 공개합니다.

English

Generative models form the backbone of modern machine learning, underpinning state-of-the-art systems in text, vision, and multimodal applications. While Maximum Likelihood Estimation has traditionally served as the dominant training paradigm, recent work have highlighted its limitations, particularly in generalization and susceptibility to catastrophic forgetting compared to Reinforcement Learning techniques, such as Policy Gradient methods. However, these approaches depend on explicit reward signals, which are often unavailable in practice, leaving open the fundamental problem of how to align generative models when only high-quality datasets are accessible. In this work, we address this challenge via a Bilevel Optimization framework, where the reward function is treated as the optimization variable of an outer-level problem, while a policy gradient objective defines the inner-level. We then conduct a theoretical analysis of this optimization problem in a tractable setting and extract insights that, as we demonstrate, generalize to applications such as tabular classification and model-based reinforcement learning. We release the code at https://github.com/abenechehab/nll_to_po .

데이터에서 보상으로: 최대 우도 추정에 대한 이중 최적화 관점

From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation

초록

Support