Humanline:在线对齐作为感知损失
Humanline: Online Alignment as Perceptual Loss
September 29, 2025
作者: Sijia Liu, Niklas Muennighoff, Kawin Ethayarajh
cs.AI
摘要
在线对齐(如GRPO)通常比离线对齐(如DPO)表现更优——但原因何在?借鉴行为经济学中的前景理论,我们提出了一种以人为中心的解释。我们证明,在线策略采样能更好地近似人类感知的模型输出分布,而PPO/GRPO风格的裁剪——最初引入是为了稳定训练——实际上恢复了人类在概率感知上的偏差。从这个意义上讲,PPO/GRPO已然充当了感知损失的角色。我们的理论进一步表明,在线/离线的二分法本身对于最大化人类效用而言是偶然的,因为我们可以通过以模仿人类感知的方式选择性地训练任何数据来达到相同效果,而无需局限于在线策略数据。这样做将使我们能够在不牺牲性能的前提下,更快、更经济、更灵活地进行后训练。为此,我们提出了一种设计模式,明确将概率的感知失真纳入DPO/KTO/GRPO等目标函数中,创造出它们的“人性化”变体。令人惊讶的是,我们发现这些“人性化”变体,即便使用离线非策略数据进行训练,也能在可验证和不可验证任务上匹敌其在线版本的表现。
English
Online alignment (e.g., GRPO) is generally more performant than offline
alignment (e.g., DPO) -- but why? Drawing on prospect theory from behavioral
economics, we propose a human-centric explanation. We prove that online
on-policy sampling better approximates the human-perceived distribution of what
the model can produce, and PPO/GRPO-style clipping -- originally introduced to
just stabilize training -- recovers a perceptual bias in how humans perceive
probability. In this sense, PPO/GRPO act as perceptual losses already. Our
theory further suggests that the online/offline dichotomy is itself incidental
to maximizing human utility, since we can achieve the same effect by
selectively training on any data in a manner that mimics human perception,
rather than restricting ourselves to online on-policy data. Doing so would
allow us to post-train more quickly, cheaply, and flexibly without sacrificing
performance. To this end, we propose a design pattern that explicitly
incorporates perceptual distortions of probability into objectives like
DPO/KTO/GRPO, creating humanline variants of them. Surprisingly, we find that
these humanline variants, even when trained with offline off-policy data, can
match the performance of their online counterparts on both verifiable and
unverifiable tasks.