圧力下のリスク：言語モデルにおける敵対的ロバスト性の計算を考慮した評価

要旨

大規模言語モデル(LLM)の敵対的ロバスト性評価では、通常、固定クエリ予算下での攻撃成功率(ASR)を報告し、暗黙的にすべての攻撃を同等のコストとみなしている。実際には、異なる攻撃戦略の計算コストは桁違いに変動する可能性がある。その結果、固定予算でのASRは、モデルを脱獄(jailbreak)するために必要な真の労力を不明瞭にし、攻撃のコストが攻撃者にとっての見返りに見合うかどうかを判断することを困難にする。本研究では、敵対的労力のプロキシとして累積浮動小数点演算数(FLOPs)で測定される計算圧力に基づく、計算認識評価フレームワークを提案する。計算予算を攻撃リスクにマッピングするリスク計算曲線を導入し、与えられた攻撃が成功するために必要な平均圧力を要約する二つのメトリクスを導出する。 3つのファミリーにまたがる10モデル、言語モデルの訓練とアライメントにおける4つの異なる段階において、2つの脱獄ロバスト性ベンチマークで3つの攻撃戦略（勾配ベース、反復洗練、テンプレートベース）を用いて評価した結果、以下のことが判明した：(1) アライメント訓練は計算空間ロバスト性に対して非単調な効果を持つ、(2) モデルサイズのスケーリングは勾配ベース攻撃の有効性を低下させるが、より安価なテンプレートベース攻撃への影響は限定的である、(3) サロゲートモデルで最適化された勾配ベース攻撃は別のターゲットモデルに転移可能であり、攻撃者のコスト削減の手段を提供する、(4) 単一モデル内でも有害カテゴリ間で計算コストは最大約5倍変動する、(5) 安全性アライメントされたRLは総コストを増加させる一方、一部のカテゴリは不均衡にアクセスしやすいままである。我々は、計算認識リスク評価と評価を可能にするフレームワークを公開する。

English

Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of different attack strategies can vary by orders of magnitude. Consequently, ASR at a fixed budget can obscure the true effort required to jailbreak a model, thereby making it hard to determine whether an attack's cost justifies its payoff to the attacker. We propose a compute-aware evaluation framework based on computational pressure, measured in cumulative floating-point operations (FLOPs), as a proxy for adversarial effort. We introduce risk-compute curves, which map compute budgets to attack risk, and derive two metrics that summarize the average pressure required for a given attack to succeed. Across ten models spanning three families and four different stages in language model training and alignment, evaluated with three attack strategies (gradient-based, iterative refinement, and template-based) on two jailbreak robustness benchmarks, we find: (1) alignment training has non-monotonic effects on compute-space robustness; (2) scaling model size reduces gradient-based attack effectiveness but has limited impact on cheaper template-based attacks; (3) gradient-based attacks optimized on a surrogate model can transfer to a separate target model, providing a way to reduce attacker costs; (4) compute cost varies by up to {approx}5{times} across harm categories within a single model; and (5) safety-aligned RL increases aggregate cost while leaving some categories disproportionately accessible. We release our framework to enable compute-aware risk assessment and evaluation.