ルーブリックを攻撃対象として：LLM審査官における潜在的な嗜好の逸脱

要旨

大規模言語モデルの評価とアライメントパイプラインでは、自然言語の評価基準に基づき動作し、ベンチマークで検証されるLLMベースの評価器への依存度が高まっている。本研究では、このワークフローに従来認識されていなかった脆弱性が存在することを明らかにし、これを「評価基準誘起選好ドリフト（RIPD）」と命名する。評価基準の修正がベンチマーク検証を通過した場合でも、対象ドメインにおける評価器の選好に体系的かつ方向性のある変化を生じさせる可能性がある。評価基準は高次元の意思決定インターフェースとして機能するため、一見自然で判断基準を維持するような修正からもこのドリフトが生じ、集計されたベンチマーク指標や限定的なスポットチェックでは検出が困難である。さらに、この脆弱性が評価基準に基づく選好攻撃として悪用され得ることを示す。ベンチマーク適合的な評価基準の修正により、対象ドメインにおいて固定された人間評価や信頼済み参照基準から判断が逸脱し、RIPDが体系的に誘発されて対象ドメインの精度が最大9.5%（有益性）および27.9%（無害性）低下する。これらの判断が下流の学習後処理における選好ラベル生成に用いられると、誘発されたバイアスはアライメントパイプラインを伝播し、学習済みポリシーに内在化される。これにより、モデル挙動に持続的かつ体系的なドリフトが生じる。総合的に、我々の知見は評価基準が敏感で操作可能な制御インターフェースであることを浮き彫りにし、評価器の信頼性のみならずシステムレベルのアライメントリスクを明らかにするものと言える。コードはhttps://github.com/ZDCSlab/Rubrics-as-an-Attack-Surfaceで公開されている。警告：一部のセクションには、すべての読者に適切ではない可能性のある有害な内容が含まれている場合があります。

English

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system-level alignment risk that extends beyond evaluator reliability alone. The code is available at: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.

ルーブリックを攻撃対象として：LLM審査官における潜在的な嗜好の逸脱

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

要旨

Support