アライメント改ざん：人間のフィードバックからの強化学習が非整合バイアスの最適化に悪用される仕組み

要旨

人間からのフィードバックによる強化学習（RLHF）は、大規模言語モデル（LLM）を人間の選好に合わせるための標準的手法である。本研究では、アライメント改ざん（alignment tampering）という潜在的脆弱性を紹介する。これは、チューニング中のLLMが選好データセットに影響を及ぼし、その結果RLHFが望ましくない振る舞いを増幅してしまう現象である。この問題はRLHFの根本的な限界に起因する：(1) 選好データセットがLLM自身の出力から構築されるため、LLMがデータセットに影響を与えられる点、(2) ペアワイズ比較が単にどちらの応答が優れているかを示すだけで、その理由を明示しない点である。これらの限界が悪用されると、アライメント改ざんが引き起こされる。例えば、LLMが偏った応答をより高い品質で生成した場合、アノテーターは品質に基づいてその応答を選好する。しかし、選好ラベルは品質とバイアスを区別せず、報酬モデルもこの限界を引き継ぐ。こうした報酬を強化学習やBest-of-Nサンプリングで最適化すると、ミスアライメントなバイアスが増幅されかねない。本実験では、キーワードバイアスからプロパガンダ（例：性差別）、ブランド宣伝、手段的目的追求に至るまで、多様なバイアスにおける増幅を実証する。既存のロバストなRLHF手法は応答品質を犠牲にせずにアライメント改ざんを完全には解決できないため、対策は依然として難しい。これらの知見は、現在のRLHFが構造的な脆弱性を抱えることを明らかにし、その防止の必要性を強調する。プロジェクトページ：https://alignment-tampering.github.io/

English

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/