从数据到奖励:基于双层优化视角的最大似然估计
From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation
October 8, 2025
作者: Abdelhakim Benechehab, Gabriel Singer, Corentin Léger, Youssef Attia El Hili, Giuseppe Paolo, Albert Thomas, Maurizio Filippone, Balázs Kégl
cs.AI
摘要
生成模型构成了现代机器学习的核心支柱,支撑着文本、视觉及多模态应用中的尖端系统。尽管最大似然估计传统上作为主导的训练范式,但近期研究揭示了其局限性,特别是在泛化能力和对灾难性遗忘的易感性方面,相较于强化学习技术(如策略梯度方法)而言。然而,这些方法依赖于显式的奖励信号,而实践中往往难以获取,这留下了一个根本性问题:当仅能访问高质量数据集时,如何对齐生成模型。在本研究中,我们通过双层优化框架应对这一挑战,其中奖励函数被视为外层优化问题的变量,而策略梯度目标则定义内层。随后,我们在一个可处理的环境下对这一优化问题进行了理论分析,并提取了洞见,正如我们所展示的,这些洞见可推广至表格分类和基于模型的强化学习等应用。我们已在https://github.com/abenechehab/nll_to_po 发布了代码。
English
Generative models form the backbone of modern machine learning, underpinning
state-of-the-art systems in text, vision, and multimodal applications. While
Maximum Likelihood Estimation has traditionally served as the dominant training
paradigm, recent work have highlighted its limitations, particularly in
generalization and susceptibility to catastrophic forgetting compared to
Reinforcement Learning techniques, such as Policy Gradient methods. However,
these approaches depend on explicit reward signals, which are often unavailable
in practice, leaving open the fundamental problem of how to align generative
models when only high-quality datasets are accessible. In this work, we address
this challenge via a Bilevel Optimization framework, where the reward function
is treated as the optimization variable of an outer-level problem, while a
policy gradient objective defines the inner-level. We then conduct a
theoretical analysis of this optimization problem in a tractable setting and
extract insights that, as we demonstrate, generalize to applications such as
tabular classification and model-based reinforcement learning. We release the
code at https://github.com/abenechehab/nll_to_po .