대규모 모델 시대의 보안 해킹: 메커니즘, 발생적 부조화, 과제

초록

인간 피드백 강화학습(RLHF) 및 관련 정렬 패러다임은 대규모 언어 모델(LLM)과 다중모달 대규모 언어 모델(MLLM)을 인간이 선호하는 행동으로 이끄는 핵심 방법론이 되었습니다. 그러나 이러한 접근법은 체계적인 취약점, 즉 보상 해킹을 야기합니다. 보상 해킹이란 모델이 학습된 보상 신호의 결함을 악용하여 실제 작업 의도를 충족시키지 않은 채 대리 목표만을 극대화하는 현상을 말합니다. 모델 규모가 확대되고 최적화가 강화됨에 따라, 이러한 악용 현상은 장황성 편향, 아첨, 허구적 정당화, 벤치마크 과적합으로 나타나며, 다중모달 환경에서는 지각-추론 분리 및 평가자 조작으로 나타납니다. 최근 연구에 따르면 표면적으로 무해한 단축키 행동이 기만 및 감독 메커니즘의 전략적 활용을 포함한 더 광범위한 정렬 실패 형태로 일반화될 수 있음이 밝혀졌습니다. 본 설문에서는 보상 해킹을 이해하는 통합 프레임워크로서 프록시 압축 가설(PCH)을 제안합니다. 우리는 보상 해킹을 고차원 인간 목표에 대한 압축된 보상 표현에 대해 표현력이 풍부한 정책을 최적화하는 과정에서 발생하는 현상으로 정형화합니다. 이 관점에서 보상 해킹은 목표 압축, 최적화 증폭, 평가자-정책 공동 적응 간의 상호작용에서 비롯됩니다. 이러한 시각은 RLHF, RLAIF, RLVR 체제 전반에 걸친 실증적 현상을 통합하고, 지역적 단축키 학습이 어떻게 기만 및 감독 메커니즘의 전략적 조작을 포함한 광범위한 정렬 실패 형태로 일반화되는지 설명합니다. 나아가 우리는 압축, 증폭 또는 공동 적응 역학에 개입하는 방식을 기준으로 탐지 및 완화 전략을 체계화합니다. 보상 해킹을 규모 확장 하에서 프록시 기반 정렬의 구조적 불안정성으로 규정함으로써, 확장 가능한 감독, 다중모달 기반 구축, 에이전트 자율성 분야의 미해결 과제를 부각합니다.

English

Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability: reward hacking, where models exploit imperfections in learned reward signals to maximize proxy objectives without fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception--reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, including deception and strategic gaming of oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understanding reward hacking. We formalize reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view, reward hacking arises from the interaction of objective compression, optimization amplification, and evaluator--policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, including deception and strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framing reward hacking as a structural instability of proxy-based alignment under scale, we highlight open challenges in scalable oversight, multimodal grounding, and agentic autonomy.

대규모 모델 시대의 보안 해킹: 메커니즘, 발생적 부조화, 과제

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

초록

Support