평가 기준표를 공격 표면으로: LLM 평가자의 은밀한 선호도 변동

초록

대규모 언어 모델의 평가 및 정렬 파이프라인은 자연어 루브릭에 의해 행동이 지도되고 벤치마크를 통해 검증되는 LLM 기반 판단 모델에 점점 더 의존하고 있습니다. 본 연구는 이러한 워크플로우에서 이전까지 충분히 인식되지 않았던 취약점을 규명하며, 이를 '루브릭 유발 선호도 편향(RIPD)'이라고 명명합니다. 루브릭 수정이 벤치마크 검증을 통과하더라도, 여전히 대상 도메인에서 판단 모델의 선호도에 체계적이고 방향성을 가진 변화를 초래할 수 있습니다. 루브릭은 높은 수준의 결정 인터페이스 역할을 하기 때문에, 이러한 편향은 겉보기에는 자연스럽고 기준을 보존하는 수정에서도 발생할 수 있으며, 집계된 벤치마크 지표나 제한된 부분 검토를 통해 탐지하기 어려울 수 있습니다. 우리는 더 나아가 이 취약점이 루브릭 기반 선호도 공격을 통해 악용될 수 있음을 보여줍니다. 이러한 공격에서는 벤치마크 기준을 충족하는 루브릭 수정이 대상 도메인에서 고정된 인간 또는 신뢰할 수 있는 기준으로부터 판단을 이탈하게 하여 체계적으로 RIPD를 유발하고, 대상 도메인 정확도를 도움성(helpfulness) 최대 9.5%, 무해성(harmlessness) 최대 27.9%까지 감소시켰습니다. 이러한 판단이 하류 단계의 사후 훈련을 위한 선호도 레이블 생성에 사용될 때, 유발된 편향은 정렬 파이프라인을 통해 전파되어 훈련된 정책에 내재화됩니다. 이는 모델 행동에 지속적이고 체계적인 편향을 초래합니다. 전반적으로, 우리의 연구 결과는 평가 루브릭이 민감하고 조작 가능한 제어 인터페이스임을 강조하며, 평가자 신뢰도 이상의 시스템 수준 정렬 위험을 드러냅니다. 코드는 https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface 에서 확인할 수 있습니다. 경고: 특정 섹션에는 모든 독자에게 적합하지 않을 수 있는 잠재적으로 유해한 내용이 포함될 수 있습니다.

English

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system-level alignment risk that extends beyond evaluator reliability alone. The code is available at: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.

평가 기준표를 공격 표면으로: LLM 평가자의 은밀한 선호도 변동

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

초록

Support