对齐篡改：基于人类反馈的强化学习如何被利用来优化失调偏见

摘要

基于人类反馈的强化学习（RLHF）是将大型语言模型（LLMs）与人类偏好对齐的标准方法。在本研究中，我们引入了"对齐篡改"这一潜在漏洞：指正在进行对齐训练的LLM能够影响偏好数据集，导致RLHF反而放大非期望行为。该漏洞源于RLHF的核心局限性：（1）偏好数据集由LLM自身的输出构建，使其具有影响数据的能力；（2）成对比较仅能表明哪个回答更优，而无法解释其内在原因。这些局限可能被利用引发对齐篡改。例如，若LLM生成的高质量回答存在偏见，标注者会基于质量因素选择偏好该回答。然而偏好标签无法区分质量与偏见，奖励模型继承这一缺陷后，通过强化学习或最佳-N采样优化此类奖励将进一步放大未对齐的偏见。实验表明，该漏洞能放大从关键词偏向到宣传行为（如性别歧视）、品牌推广及工具性目标追求等多类偏见。缓解该问题颇具挑战性：现有鲁棒RLHF技术若不牺牲回答质量，便无法彻底解决对齐篡改。这些发现揭示了当前RLHF的结构性脆弱性，并强调了防范该漏洞的迫切性。项目页面：https://alignment-tampering.github.io/

English

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/