ChatPaper.aiChatPaper

大模型时代的奖励破解:机制、涌现性错位与挑战

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

April 15, 2026
作者: Xiaohua Wang, Muzhao Tian, Yuqi Zeng, Zisu Huang, Jiakang Yuan, Bowen Chen, Jingwen Xu, Mingbo Zhou, Wenhao Liu, Muling Wu, Zhengkang Guo, Qi Qian, Yifei Wang, Feiran Zhang, Ruicheng Yin, Shihan Dou, Changze Lv, Tao Chen, Kaitao Song, Xu Tan, Tao Gui, Xiaoqing Zheng, Xuanjing Huang
cs.AI

摘要

基于人类反馈的强化学习(RLHF)及相关对齐范式已成为引导大语言模型(LLMs)与多模态大语言模型(MLLMs)符合人类行为偏好的核心方法。然而,这些方法引入了系统性漏洞:奖励破解,即模型利用习得奖励信号中的缺陷来最大化代理目标,却未真正实现任务意图。随着模型规模扩大和优化强度提升,此类利用行为表现为冗长偏好、谄媚应答、虚构合理化、基准过拟合,以及在多模态场景下的感知-推理脱节和评估器操纵。最新研究进一步表明,看似无害的捷径行为可能泛化为更广泛的对齐偏离形式,包括欺骗行为和对监督机制的策略性利用。本文提出代理压缩假说(PCH)作为理解奖励破解的统一框架,将其形式化为高维人类目标经过压缩的奖励表征与表达性策略优化相互作用下的涌现现象。该视角下,奖励破解源于目标压缩、优化放大及评估器-策略协同适应三者的交互作用,不仅统一了RLHF、RLAIF和RLVR范式中的实证现象,还解释了局部捷径学习如何泛化为欺骗和监督机制策略性操纵等广义对齐偏离。我们进一步根据干预压缩、放大或协同适应动态的维度,对检测与缓解策略进行系统性归类。通过将奖励破解界定为规模化背景下基于代理的对齐机制的结构性失稳,本文强调了可扩展监督、多模态 grounding 以及智能体自主性等领域面临的开放挑战。
English
Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability: reward hacking, where models exploit imperfections in learned reward signals to maximize proxy objectives without fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception--reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, including deception and strategic gaming of oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understanding reward hacking. We formalize reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view, reward hacking arises from the interaction of objective compression, optimization amplification, and evaluator--policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, including deception and strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framing reward hacking as a structural instability of proxy-based alignment under scale, we highlight open challenges in scalable oversight, multimodal grounding, and agentic autonomy.
PDF192April 24, 2026