從數據到獎勵:基於雙層優化視角的最大似然估計
From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation
October 8, 2025
作者: Abdelhakim Benechehab, Gabriel Singer, Corentin Léger, Youssef Attia El Hili, Giuseppe Paolo, Albert Thomas, Maurizio Filippone, Balázs Kégl
cs.AI
摘要
生成模型構成了現代機器學習的基石,支撐著文本、視覺及多模態應用中尖端系統的發展。儘管最大似然估計長期以來作為主導的訓練範式,但近期研究揭示了其局限性,特別是在泛化能力及與強化學習技術(如策略梯度方法)相比對災難性遺忘的易感性方面。然而,這些方法依賴於顯式的獎勵信號,而這些信號在實際應用中往往難以獲取,這使得在僅能獲取高質量數據集的情況下如何對齊生成模型這一根本問題懸而未決。在本研究中,我們通過雙層優化框架應對這一挑戰,其中獎勵函數被視為外層問題的優化變量,而策略梯度目標則定義了內層問題。隨後,我們在一個可處理的設定下對這一優化問題進行了理論分析,並提取了見解,這些見解,正如我們所展示的,能夠推廣至諸如表格分類和基於模型的強化學習等應用中。我們已將代碼發佈於https://github.com/abenechehab/nll_to_po。
English
Generative models form the backbone of modern machine learning, underpinning
state-of-the-art systems in text, vision, and multimodal applications. While
Maximum Likelihood Estimation has traditionally served as the dominant training
paradigm, recent work have highlighted its limitations, particularly in
generalization and susceptibility to catastrophic forgetting compared to
Reinforcement Learning techniques, such as Policy Gradient methods. However,
these approaches depend on explicit reward signals, which are often unavailable
in practice, leaving open the fundamental problem of how to align generative
models when only high-quality datasets are accessible. In this work, we address
this challenge via a Bilevel Optimization framework, where the reward function
is treated as the optimization variable of an outer-level problem, while a
policy gradient objective defines the inner-level. We then conduct a
theoretical analysis of this optimization problem in a tractable setting and
extract insights that, as we demonstrate, generalize to applications such as
tabular classification and model-based reinforcement learning. We release the
code at https://github.com/abenechehab/nll_to_po .