基于能力的大语言模型红队测试扩展法则
Capability-Based Scaling Laws for LLM Red-Teaming
May 26, 2025
作者: Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, Jonas Geiping
cs.AI
摘要
随着大型语言模型在能力和自主性方面的不断提升,通过红队测试识别其脆弱性对于安全部署变得至关重要。然而,一旦红队测试演变为强弱对抗问题,即目标模型的能力超越红队成员时,传统的提示工程方法可能失效。为研究这一转变,我们从攻击者与目标之间能力差距的视角来审视红队测试。我们评估了超过500对攻击者-目标组合,采用基于LLM的越狱攻击模拟人类红队成员,涵盖不同模型家族、规模和能力水平。三个显著趋势显现:(一)能力更强的模型作为攻击者表现更佳,(二)一旦目标能力超过攻击者,攻击成功率急剧下降,(三)攻击成功率与MMLU-Pro基准测试中社会科学部分的高分表现相关。基于这些趋势,我们推导出一条越狱扩展定律,能够根据攻击者与目标的能力差距预测固定目标的攻击成功率。这些发现表明,固定能力的攻击者(如人类)可能在未来模型面前失效,日益强大的开源模型加剧了现有系统的风险,模型提供商必须准确衡量并控制模型的劝说与操控能力,以限制其作为攻击者的效能。
English
As large language models grow in capability and agency, identifying
vulnerabilities through red-teaming becomes vital for safe deployment. However,
traditional prompt-engineering approaches may prove ineffective once
red-teaming turns into a weak-to-strong problem, where target models surpass
red-teamers in capabilities. To study this shift, we frame red-teaming through
the lens of the capability gap between attacker and target. We evaluate more
than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic
human red-teamers across diverse families, sizes, and capability levels. Three
strong trends emerge: (i) more capable models are better attackers, (ii) attack
success drops sharply once the target's capability exceeds the attacker's, and
(iii) attack success rates correlate with high performance on social science
splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking
scaling law that predicts attack success for a fixed target based on
attacker-target capability gap. These findings suggest that fixed-capability
attackers (e.g., humans) may become ineffective against future models,
increasingly capable open-source models amplify risks for existing systems, and
model providers must accurately measure and control models' persuasive and
manipulative abilities to limit their effectiveness as attackers.Summary
AI-Generated Summary