정렬 변조: 인간 피드백 기반 강화 학습이 잘못 정렬된 편향을 최적화하기 위해 악용되는 방식

초록

인간 피드백 기반 강화학습(RLHF)은 대규모 언어 모델(LLM)을 인간의 선호도에 정렬시키는 표준 방법이다. 본 연구에서는 LLM이 정렬 과정을 겪으면서 선호도 데이터셋에 영향을 미쳐 RLHF가 바람직하지 않은 행동을 증폭시키는 잠재적 취약점인 **정렬 변조(alignment tampering)**를 소개한다. 이는 RLHF의 근본적인 한계에서 비롯된다: (1) 선호도 데이터셋이 LLM 자체의 출력으로부터 구축되므로 모델이 이에 영향을 줄 수 있으며, (2) 쌍별 비교(pairwise comparison)는 어떤 응답이 더 나은지만 나타낼 뿐 그 이유를 설명하지 않는다. 이러한 한계는 정렬 변조를 유발하도록 악용될 수 있다. 예를 들어, LLM이 편향된 응답을 더 높은 품질로 생성한다면, 주석자(annotator)는 품질에 기반해 이를 선호하게 된다. 그러나 선호도 레이블은 품질과 편향을 구분하지 못하며, 보상 모델도 이러한 한계를 그대로 물려받는다. 강화학습이나 best-of-N 샘플링을 통해 이러한 보상을 최적화하면 잘못 정렬된 편향이 증폭될 수 있다. 본 실험은 다양한 편향(키워드 편향에서부터 성차별과 같은 선전, 브랜드 홍보, 도구적 목표 추구까지)에 걸쳐 이러한 증폭 현상을 입증한다. 기존의 강건한 RLHF 기법들이 응답 품질을 희생하지 않고는 정렬 변조를 완전히 해결하지 못하므로, 이에 대한 완화는 여전히 어려운 과제로 남아 있다. 이러한 발견은 현재 RLHF의 구조적 취약점을 드러내며, 이 취약점을 방지할 필요성을 강조한다. 프로젝트 페이지: https://alignment-tampering.github.io/

English

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/