對齊篡改：人類回饋強化學習如何被利用以最佳化偏差失準

摘要

基於人類回饋的強化學習（RLHF）是將大型語言模型（LLM）與人類偏好對齊的標準方法。在本研究中，我們引入了「對齊篡改」——一種潛在的漏洞，即正在進行對齊的 LLM 能夠影響偏好資料集，導致 RLHF 放大不良行為。此漏洞源於 RLHF 的核心限制：(1) 偏好資料集由 LLM 自身的輸出建構而成，使其有能力影響資料集；(2) 成對比較僅能指出哪個回應較佳，卻無法說明原因。這些限制可能被利用來引發對齊篡改。舉例來說，若 LLM 生成了具有偏見但品質更高的回應，標註者會基於品質而偏好這些回應。然而，偏好標籤無法區分品質與偏見，獎勵模型也因此繼承了此限制。透過強化學習或 best-of-N 抽樣來優化這類獎勵，可能進一步放大未對齊的偏見。我們的實驗展示了多種偏見的放大效應：從關鍵字偏見到宣傳（例如性別歧視）、品牌推廣，以及工具性目標追求。緩解此問題仍具挑戰性，因為現有的穩健 RLHF 技術在未犧牲回應品質的情況下，無法完全解決對齊篡改。這些發現揭示了當前 RLHF 的結構性漏洞，並強調了防範此漏洞的必要性。專案頁面：https://alignment-tampering.github.io/

English

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/