稻草堆中的越狱

摘要

近期长上下文语言模型（LM）的突破已实现百万级令牌的输入处理能力，显著扩展了其在计算机使用代理等复杂任务中的应用范围。然而，这种扩展上下文的安全影响仍不明确。为填补这一空白，我们提出NINJA（Needle-in-haystack jailbreak attack的简称），该方法通过在对齐后的语言模型末尾附加模型生成的良性内容来实现对有害用户目标的越狱攻击。我们方法的关键发现是：有害目标在上下文中的位置对安全性具有重要影响。在标准安全基准测试HarmBench上的实验表明，NINJA能显著提升针对LLaMA、Qwen、Mistral和Gemini等前沿开源与专有模型的攻击成功率。与现有越狱方法不同，我们的方法具有低资源消耗、可迁移性强且更难检测的特点。此外，我们证明NINJA具备计算最优性——在固定计算预算下，增加上下文长度相比增加最佳N次越狱尝试次数能获得更优效果。这些发现表明，即便是良性长上下文——当辅以精心设计的目标定位时——也会在现代语言模型中引发根本性安全漏洞。

English

Recent advances in long-context language models (LMs) have enabled million-token inputs, expanding their capabilities across complex tasks like computer-use agents. Yet, the safety implications of these extended contexts remain unclear. To bridge this gap, we introduce NINJA (short for Needle-in-haystack jailbreak attack), a method that jailbreaks aligned LMs by appending benign, model-generated content to harmful user goals. Critical to our method is the observation that the position of harmful goals play an important role in safety. Experiments on standard safety benchmark, HarmBench, show that NINJA significantly increases attack success rates across state-of-the-art open and proprietary models, including LLaMA, Qwen, Mistral, and Gemini. Unlike prior jailbreaking methods, our approach is low-resource, transferable, and less detectable. Moreover, we show that NINJA is compute-optimal -- under a fixed compute budget, increasing context length can outperform increasing the number of trials in best-of-N jailbreak. These findings reveal that even benign long contexts -- when crafted with careful goal positioning -- introduce fundamental vulnerabilities in modern LMs.

稻草堆中的越狱

Jailbreaking in the Haystack

摘要

Support