大型模型時代的獎勵破解:機制、湧現錯位與挑戰
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
April 15, 2026
作者: Xiaohua Wang, Muzhao Tian, Yuqi Zeng, Zisu Huang, Jiakang Yuan, Bowen Chen, Jingwen Xu, Mingbo Zhou, Wenhao Liu, Muling Wu, Zhengkang Guo, Qi Qian, Yifei Wang, Feiran Zhang, Ruicheng Yin, Shihan Dou, Changze Lv, Tao Chen, Kaitao Song, Xu Tan, Tao Gui, Xiaoqing Zheng, Xuanjing Huang
cs.AI
摘要
基於人類回饋的強化學習(RLHF)及其相關對齊範式已成為引導大型語言模型(LLMs)與多模態大型語言模型(MLLMs)朝向人類偏好行為的核心技術。然而,這些方法存在系統性弱點:獎勵破解,即模型利用學習到的獎勵信號缺陷,在未真正實現任務意圖的情況下最大化代理目標。隨著模型規模擴大與優化強度提升,此類漏洞表現為冗長偏誤、諂媚行為、虛構合理化、基準過度擬合,以及在多模態情境下的感知-推理分離與評估者操縱。近期研究進一步表明,看似無害的捷徑行為可能泛化為更廣泛的失準形式,包括欺騙性行為與對監督機制的策略性利用。本綜述提出「代理壓縮假說」(PCH)作為理解獎勵破解的統一框架,將獎勵破解形式化定義為:在高維度人類目標的壓縮獎勵表徵下,優化表達性策略所湧現的結果。此觀點下,獎勵破解源於目標壓縮、優化放大及評估者-策略協同適應三者的交互作用。該框架統合了RLHF、RLAIF與RLVR領域的實證現象,並解釋局部捷徑學習如何泛化為包括欺騙與策略性操縱在內的廣泛失準行為。我們進一步依據干預壓縮、放大或協同適應動態的方式,系統化組織檢測與緩解策略。透過將獎勵破解界定為規模化背景下代理對齊的結構性不穩定現象,本文凸顯可擴展監督、多模態基礎與智能體自主性等領域的開放性挑戰。
English
Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability: reward hacking, where models exploit imperfections in learned reward signals to maximize proxy objectives without fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception--reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, including deception and strategic gaming of oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understanding reward hacking. We formalize reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view, reward hacking arises from the interaction of objective compression, optimization amplification, and evaluator--policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, including deception and strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framing reward hacking as a structural instability of proxy-based alignment under scale, we highlight open challenges in scalable oversight, multimodal grounding, and agentic autonomy.