憲法分類器：抵禦跨越數千小時紅隊測試的通用越獄

摘要

大型語言模型（LLMs）容易受到通用越獄攻擊的威脅，這些攻擊策略系統性地繞過模型防護措施，使用戶能夠執行需要多次模型交互的有害過程，例如大規模製造非法物質。為了抵禦這些攻擊，我們引入憲法分類器：通過合成數據訓練的防護措施，通過提示LLMs使用自然語言規則（即憲法）生成的合成數據，明確規定允許和限制的內容。在超過3,000個估計的紅隊測試中，沒有一個紅隊成員找到一種通用越獄方法，可以像在大多數目標查詢中的未受保護模型那樣以相似細節水平從早期受分類器保護的LLM中提取信息。在自動評估中，增強的分類器展示了對領域特定越獄攻擊的強大防禦能力。這些分類器還保持了部署的可行性，生產流量拒絕率絕對增加了0.38％，推理開銷增加了23.7％。我們的工作表明，防禦通用越獄攻擊並保持實際部署可行性是可行的。

English

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

憲法分類器：抵禦跨越數千小時紅隊測試的通用越獄

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

摘要

Support