乾草堆中的越獄
Jailbreaking in the Haystack
November 5, 2025
作者: Rishi Rajesh Shah, Chen Henry Wu, Shashwat Saxena, Ziqian Zhong, Alexander Robey, Aditi Raghunathan
cs.AI
摘要
近期長文本語言模型的突破性進展已能處理百萬級token的輸入,顯著提升了其在電腦代理等複雜任務中的應用能力。然而,這種擴展上下文所帶來的安全影響仍不明朗。為填補此研究空白,我們提出NINJA(Needle-in-haystack jailbreak attack的簡稱),該方法通過在有害用戶目標後附加模型生成的良性內容,實現對已對齊語言模型的越獄攻擊。我們方法的關鍵在於發現有害目標在文本中的位置對模型安全性具有重要影響。在標準安全基準測試HarmBench上的實驗表明,NINJA能顯著提升對LLaMA、Qwen、Mistral和Gemini等前沿開源與專有模型的攻擊成功率。有別於傳統越獄方法,本方法具備低資源消耗、可遷移性強及隱蔽性高等特點。更重要的是,我們證明NINJA具有計算最優性——在固定計算預算下,增加上下文長度相比增加最佳N次越獄的嘗試次數更具效益。這些發現揭示:即使是由良性內容構成的長上下文,若經過精心的目標位置設計,仍會為現代語言模型引入根本性安全漏洞。
English
Recent advances in long-context language models (LMs) have enabled
million-token inputs, expanding their capabilities across complex tasks like
computer-use agents. Yet, the safety implications of these extended contexts
remain unclear. To bridge this gap, we introduce NINJA (short for
Needle-in-haystack jailbreak attack), a method that jailbreaks aligned LMs by
appending benign, model-generated content to harmful user goals. Critical to
our method is the observation that the position of harmful goals play an
important role in safety. Experiments on standard safety benchmark, HarmBench,
show that NINJA significantly increases attack success rates across
state-of-the-art open and proprietary models, including LLaMA, Qwen, Mistral,
and Gemini. Unlike prior jailbreaking methods, our approach is low-resource,
transferable, and less detectable. Moreover, we show that NINJA is
compute-optimal -- under a fixed compute budget, increasing context length can
outperform increasing the number of trials in best-of-N jailbreak. These
findings reveal that even benign long contexts -- when crafted with careful
goal positioning -- introduce fundamental vulnerabilities in modern LMs.