Humanline：線上對齊作為感知損失

摘要

在线对齐（例如GRPO）通常比离线对齐（例如DPO）表现更优——但原因何在？借鉴行为经济学中的前景理论，我们提出了一种以人为中心的解释。我们证明，在线同策略采样能更好地近似模型产出的人类感知分布，而PPO/GRPO风格的裁剪——最初引入是为了稳定训练——实际上恢复了人类在概率感知上的偏差。从这个意义上讲，PPO/GRPO已然充当了感知损失的角色。我们的理论进一步表明，在线与离线的二分法本身对于最大化人类效用而言是偶然的，因为通过以模仿人类感知的方式有选择性地训练任何数据，而非局限于在线同策略数据，我们也能达到相同的效果。这样做将使我们能够在不牺牲性能的前提下，更快、更经济、更灵活地进行后训练。为此，我们提出了一种设计模式，明确将概率的感知失真融入如DPO/KTO/GRPO等目标中，创造出它们的“人线”变体。令人惊讶的是，我们发现这些“人线”变体，即便使用离线异策略数据进行训练，也能在可验证与不可验证任务上匹敌其在线对应物的表现。

English

Online alignment (e.g., GRPO) is generally more performant than offline alignment (e.g., DPO) -- but why? Drawing on prospect theory from behavioral economics, we propose a human-centric explanation. We prove that online on-policy sampling better approximates the human-perceived distribution of what the model can produce, and PPO/GRPO-style clipping -- originally introduced to just stabilize training -- recovers a perceptual bias in how humans perceive probability. In this sense, PPO/GRPO act as perceptual losses already. Our theory further suggests that the online/offline dichotomy is itself incidental to maximizing human utility, since we can achieve the same effect by selectively training on any data in a manner that mimics human perception, rather than restricting ourselves to online on-policy data. Doing so would allow us to post-train more quickly, cheaply, and flexibly without sacrificing performance. To this end, we propose a design pattern that explicitly incorporates perceptual distortions of probability into objectives like DPO/KTO/GRPO, creating humanline variants of them. Surprisingly, we find that these humanline variants, even when trained with offline off-policy data, can match the performance of their online counterparts on both verifiable and unverifiable tasks.

Humanline：線上對齊作為感知損失

Humanline: Online Alignment as Perceptual Loss

摘要

Support