LLM 레드팀을 위한 능력 기반 스케일링 법칙

초록

대규모 언어 모델의 능력과 자율성이 증가함에 따라, 안전한 배포를 위해 레드팀을 통해 취약점을 식별하는 것이 중요해졌습니다. 그러나 전통적인 프롬프트 엔지니어링 접근 방식은 레드팀이 약자 대 강자 문제로 전환될 때 비효율적일 수 있습니다. 여기서 대상 모델이 레드팀의 능력을 초과하는 상황을 말합니다. 이러한 변화를 연구하기 위해, 우리는 공격자와 대상 간의 능력 격차라는 관점에서 레드팀을 분석합니다. 다양한 계열, 크기, 능력 수준을 가진 인간 레드팀을 모방한 LLM 기반의 탈옥 공격을 사용하여 500개 이상의 공격자-대상 쌍을 평가했습니다. 세 가지 강력한 경향이 나타났습니다: (i) 더 능력 있는 모델이 더 나은 공격자가 되며, (ii) 대상의 능력이 공격자를 초과하면 공격 성공률이 급격히 감소하며, (iii) 공격 성공률은 MMLU-Pro 벤치마크의 사회과학 분야에서 높은 성능과 상관관계가 있습니다. 이러한 경향을 바탕으로, 우리는 공격자-대상 능력 격차에 기반하여 고정된 대상에 대한 공격 성공률을 예측하는 탈옥 스케일링 법칙을 도출했습니다. 이러한 연구 결과는 고정된 능력을 가진 공격자(예: 인간)가 미래 모델에 대해 비효율적이 될 수 있으며, 점점 더 능력이 향상되는 오픈소스 모델이 기존 시스템에 대한 위험을 증폭시킬 수 있음을 시사합니다. 또한, 모델 제공자는 모델의 설득 및 조작 능력을 정확히 측정하고 제어하여 공격자로서의 효과를 제한해야 합니다.

English

As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the capability gap between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target's capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.

LLM 레드팀을 위한 능력 기반 스케일링 법칙

Capability-Based Scaling Laws for LLM Red-Teaming

초록

Support