基於能力的LLM紅隊測試規模化法則

摘要

隨著大型語言模型的能力與自主性不斷提升，透過紅隊測試識別其脆弱性對於安全部署變得至關重要。然而，一旦紅隊測試轉變為弱對強的問題，即目標模型的能力超越紅隊成員時，傳統的提示工程方法可能失效。為研究這一轉變，我們從攻擊者與目標之間的能力差距視角來構建紅隊測試框架。我們評估了超過500組攻擊者-目標配對，使用基於LLM的越獄攻擊來模擬人類紅隊成員，涵蓋多種模型家族、規模及能力水平。三項顯著趨勢浮現：(i) 能力更強的模型作為攻擊者更為有效，(ii) 當目標能力超越攻擊者時，攻擊成功率急劇下降，(iii) 攻擊成功率與MMLU-Pro基準測試中社會科學部分的高表現相關。基於這些趨勢，我們推導出一條越獄擴展定律，能夠根據攻擊者與目標的能力差距預測固定目標下的攻擊成功率。這些發現表明，固定能力的攻擊者（如人類）可能對未來模型失效，日益強大的開源模型加劇了現有系統的風險，模型提供商必須準確衡量並控制模型的說服與操控能力，以限制其作為攻擊者的效力。

English

As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the capability gap between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target's capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.

基於能力的LLM紅隊測試規模化法則

Capability-Based Scaling Laws for LLM Red-Teaming

摘要

Support