ChatPaper.aiChatPaper

Humanline:線上對齊作為感知損失

Humanline: Online Alignment as Perceptual Loss

September 29, 2025
作者: Sijia Liu, Niklas Muennighoff, Kawin Ethayarajh
cs.AI

摘要

在线对齐(例如GRPO)通常比离线对齐(例如DPO)表现更优——但原因何在?借鉴行为经济学中的前景理论,我们提出了一种以人为中心的解释。我们证明,在线同策略采样能更好地近似模型产出的人类感知分布,而PPO/GRPO风格的裁剪——最初引入是为了稳定训练——实际上恢复了人类在概率感知上的偏差。从这个意义上讲,PPO/GRPO已然充当了感知损失的角色。我们的理论进一步表明,在线与离线的二分法本身对于最大化人类效用而言是偶然的,因为通过以模仿人类感知的方式有选择性地训练任何数据,而非局限于在线同策略数据,我们也能达到相同的效果。这样做将使我们能够在不牺牲性能的前提下,更快、更经济、更灵活地进行后训练。为此,我们提出了一种设计模式,明确将概率的感知失真融入如DPO/KTO/GRPO等目标中,创造出它们的“人线”变体。令人惊讶的是,我们发现这些“人线”变体,即便使用离线异策略数据进行训练,也能在可验证与不可验证任务上匹敌其在线对应物的表现。
English
Online alignment (e.g., GRPO) is generally more performant than offline alignment (e.g., DPO) -- but why? Drawing on prospect theory from behavioral economics, we propose a human-centric explanation. We prove that online on-policy sampling better approximates the human-perceived distribution of what the model can produce, and PPO/GRPO-style clipping -- originally introduced to just stabilize training -- recovers a perceptual bias in how humans perceive probability. In this sense, PPO/GRPO act as perceptual losses already. Our theory further suggests that the online/offline dichotomy is itself incidental to maximizing human utility, since we can achieve the same effect by selectively training on any data in a manner that mimics human perception, rather than restricting ourselves to online on-policy data. Doing so would allow us to post-train more quickly, cheaply, and flexibly without sacrificing performance. To this end, we propose a design pattern that explicitly incorporates perceptual distortions of probability into objectives like DPO/KTO/GRPO, creating humanline variants of them. Surprisingly, we find that these humanline variants, even when trained with offline off-policy data, can match the performance of their online counterparts on both verifiable and unverifiable tasks.
PDF81October 1, 2025