압력 하의 위험: 언어 모델의 적대적 강건성에 대한 연산 인식 평가

초록

대규모 언어 모델(LLM)의 적대적 강건성 평가는 일반적으로 고정된 쿼리 예산 하에서 공격 성공률(ASR)을 보고하며, 모든 공격이 동일한 비용을 소모한다고 암묵적으로 가정한다. 실제로 서로 다른 공격 전략의 계산 비용은 수 배에서 수십 배까지 차이가 날 수 있다. 따라서 고정된 예산에서의 ASR은 모델을 탈옥(jailbreak)하는 데 필요한 실제 노력을 모호하게 만들어, 공격자가 특정 공격의 비용이 그 효과를 정당화하는지 판단하기 어렵게 만든다. 본 연구에서는 적대적 노력의 대리 지표로 누적 부동 소수점 연산(FLOPs)으로 측정된 계산 압력(computational pressure)에 기반한 계산 인식 평가 프레임워크를 제안한다. 계산 예산을 공격 위험에 매핑하는 위험-연산 곡선(risk-compute curve)을 도입하고, 주어진 공격이 성공하는 데 필요한 평균 압력을 요약하는 두 가지 지표를 도출한다. 세 가지 공격 전략(구배 기반, 반복 정제, 템플릿 기반)을 사용하여 언어 모델 훈련 및 정렬의 네 가지 서로 다른 단계에 걸친 세 가지 계열의 열 가지 모델을 두 가지 탈옥 강건성 벤치마크에서 평가한 결과, 다음과 같은 사실을 발견했다: (1) 정렬 훈련은 계산 공간 강건성에 비단조적 효과를 미친다; (2) 모델 크기를 확장하면 구배 기반 공격의 효과는 감소하지만, 비용이 저렴한 템플릿 기반 공격에는 제한적인 영향을 미친다; (3) 대리 모델(surrogate model)에서 최적화된 구배 기반 공격은 별도의 대상 모델로 전이될 수 있어 공격자의 비용 절감 수단을 제공한다; (4) 단일 모델 내에서도 위해 범주(harm category) 간 계산 비용이 최대 약 5배까지 차이 난다; (5) 안전 정렬 RL(safety-aligned RL)은 전체 비용을 증가시키지만 일부 범주는 불균형적으로 접근하기 쉬운 상태로 남겨둔다. 계산 인식 위험 평가 및 평가를 가능하게 하기 위해 본 프레임워크를 공개한다.

English

Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of different attack strategies can vary by orders of magnitude. Consequently, ASR at a fixed budget can obscure the true effort required to jailbreak a model, thereby making it hard to determine whether an attack's cost justifies its payoff to the attacker. We propose a compute-aware evaluation framework based on computational pressure, measured in cumulative floating-point operations (FLOPs), as a proxy for adversarial effort. We introduce risk-compute curves, which map compute budgets to attack risk, and derive two metrics that summarize the average pressure required for a given attack to succeed. Across ten models spanning three families and four different stages in language model training and alignment, evaluated with three attack strategies (gradient-based, iterative refinement, and template-based) on two jailbreak robustness benchmarks, we find: (1) alignment training has non-monotonic effects on compute-space robustness; (2) scaling model size reduces gradient-based attack effectiveness but has limited impact on cheaper template-based attacks; (3) gradient-based attacks optimized on a surrogate model can transfer to a separate target model, providing a way to reduce attacker costs; (4) compute cost varies by up to {approx}5{times} across harm categories within a single model; and (5) safety-aligned RL increases aggregate cost while leaving some categories disproportionately accessible. We release our framework to enable compute-aware risk assessment and evaluation.