Humanline：在线对齐作为感知损失

摘要

在线对齐（如GRPO）通常比离线对齐（如DPO）表现更优——但原因何在？借鉴行为经济学中的前景理论，我们提出了一种以人为中心的解释。我们证明，在线策略采样能更好地近似人类感知的模型输出分布，而PPO/GRPO风格的裁剪——最初引入是为了稳定训练——实际上恢复了人类在概率感知上的偏差。从这个意义上讲，PPO/GRPO已然充当了感知损失的角色。我们的理论进一步表明，在线/离线的二分法本身对于最大化人类效用而言是偶然的，因为我们可以通过以模仿人类感知的方式选择性地训练任何数据来达到相同效果，而无需局限于在线策略数据。这样做将使我们能够在不牺牲性能的前提下，更快、更经济、更灵活地进行后训练。为此，我们提出了一种设计模式，明确将概率的感知失真纳入DPO/KTO/GRPO等目标函数中，创造出它们的“人性化”变体。令人惊讶的是，我们发现这些“人性化”变体，即便使用离线非策略数据进行训练，也能在可验证和不可验证任务上匹敌其在线版本的表现。

English

Online alignment (e.g., GRPO) is generally more performant than offline alignment (e.g., DPO) -- but why? Drawing on prospect theory from behavioral economics, we propose a human-centric explanation. We prove that online on-policy sampling better approximates the human-perceived distribution of what the model can produce, and PPO/GRPO-style clipping -- originally introduced to just stabilize training -- recovers a perceptual bias in how humans perceive probability. In this sense, PPO/GRPO act as perceptual losses already. Our theory further suggests that the online/offline dichotomy is itself incidental to maximizing human utility, since we can achieve the same effect by selectively training on any data in a manner that mimics human perception, rather than restricting ourselves to online on-policy data. Doing so would allow us to post-train more quickly, cheaply, and flexibly without sacrificing performance. To this end, we propose a design pattern that explicitly incorporates perceptual distortions of probability into objectives like DPO/KTO/GRPO, creating humanline variants of them. Surprisingly, we find that these humanline variants, even when trained with offline off-policy data, can match the performance of their online counterparts on both verifiable and unverifiable tasks.

Humanline：在线对齐作为感知损失

Humanline: Online Alignment as Perceptual Loss

摘要

Support