重压之下的风险：语言模型对抗鲁棒性的计算感知评估

摘要

大型语言模型（LLMs）的对抗鲁棒性评估通常报告在固定查询预算下的攻击成功率（ASR），这隐含地假设所有攻击具有相同的成本。然而在实际中，不同攻击策略的计算开销可能相差数个数量级。因此，固定预算下的ASR可能掩盖破解模型所需的真实努力，从而难以判断攻击成本是否与其对攻击者的收益相匹配。我们提出一种基于计算压力的计算感知评估框架，以累计浮点运算次数（FLOPs）作为对抗努力程度的代理指标。我们引入风险-计算曲线，将计算预算映射为攻击风险，并推导出两个指标来总结给定攻击成功所需的平均压力。通过在三个模型家族、四种不同语言模型训练与对齐阶段中的十个模型上，采用三种攻击策略（基于梯度、迭代优化和模板方法）在两个越狱鲁棒性基准测试中进行评估，我们发现：（1）对齐训练对计算空间鲁棒性的影响呈现非单调性；（2）扩大模型规模会降低基于梯度的攻击效果，但对成本更低的模板攻击影响有限；（3）在替代模型上优化的基于梯度的攻击可以迁移到独立的目标模型，从而降低攻击者成本；（4）在单个模型内部，不同危害类别的计算成本差异可达约5倍；（5）安全对齐的强化学习增加了总体成本，同时使某些类别以不成比例的方式更易被攻击。我们开源该框架，以支持计算感知的风险评估与测试。

English

Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of different attack strategies can vary by orders of magnitude. Consequently, ASR at a fixed budget can obscure the true effort required to jailbreak a model, thereby making it hard to determine whether an attack's cost justifies its payoff to the attacker. We propose a compute-aware evaluation framework based on computational pressure, measured in cumulative floating-point operations (FLOPs), as a proxy for adversarial effort. We introduce risk-compute curves, which map compute budgets to attack risk, and derive two metrics that summarize the average pressure required for a given attack to succeed. Across ten models spanning three families and four different stages in language model training and alignment, evaluated with three attack strategies (gradient-based, iterative refinement, and template-based) on two jailbreak robustness benchmarks, we find: (1) alignment training has non-monotonic effects on compute-space robustness; (2) scaling model size reduces gradient-based attack effectiveness but has limited impact on cheaper template-based attacks; (3) gradient-based attacks optimized on a surrogate model can transfer to a separate target model, providing a way to reduce attacker costs; (4) compute cost varies by up to {approx}5{times} across harm categories within a single model; and (5) safety-aligned RL increases aggregate cost while leaving some categories disproportionately accessible. We release our framework to enable compute-aware risk assessment and evaluation.