乾草堆中的越獄

摘要

近期長文本語言模型的突破性進展已能處理百萬級token的輸入，顯著提升了其在電腦代理等複雜任務中的應用能力。然而，這種擴展上下文所帶來的安全影響仍不明朗。為填補此研究空白，我們提出NINJA（Needle-in-haystack jailbreak attack的簡稱），該方法通過在有害用戶目標後附加模型生成的良性內容，實現對已對齊語言模型的越獄攻擊。我們方法的關鍵在於發現有害目標在文本中的位置對模型安全性具有重要影響。在標準安全基準測試HarmBench上的實驗表明，NINJA能顯著提升對LLaMA、Qwen、Mistral和Gemini等前沿開源與專有模型的攻擊成功率。有別於傳統越獄方法，本方法具備低資源消耗、可遷移性強及隱蔽性高等特點。更重要的是，我們證明NINJA具有計算最優性——在固定計算預算下，增加上下文長度相比增加最佳N次越獄的嘗試次數更具效益。這些發現揭示：即使是由良性內容構成的長上下文，若經過精心的目標位置設計，仍會為現代語言模型引入根本性安全漏洞。

English

Recent advances in long-context language models (LMs) have enabled million-token inputs, expanding their capabilities across complex tasks like computer-use agents. Yet, the safety implications of these extended contexts remain unclear. To bridge this gap, we introduce NINJA (short for Needle-in-haystack jailbreak attack), a method that jailbreaks aligned LMs by appending benign, model-generated content to harmful user goals. Critical to our method is the observation that the position of harmful goals play an important role in safety. Experiments on standard safety benchmark, HarmBench, show that NINJA significantly increases attack success rates across state-of-the-art open and proprietary models, including LLaMA, Qwen, Mistral, and Gemini. Unlike prior jailbreaking methods, our approach is low-resource, transferable, and less detectable. Moreover, we show that NINJA is compute-optimal -- under a fixed compute budget, increasing context length can outperform increasing the number of trials in best-of-N jailbreak. These findings reveal that even benign long contexts -- when crafted with careful goal positioning -- introduce fundamental vulnerabilities in modern LMs.

乾草堆中的越獄

Jailbreaking in the Haystack

摘要

Support