語法約束解碼可越獄大型語言模型使其生成惡意程式碼

摘要

大型語言模型（LLMs）日益廣泛應用於程式碼生成，引發其可能被濫用於產生惡意程式碼的擔憂。與此同時，語法約束解碼（Grammar-Constrained Decoding, GCD）已廣泛用於透過強制語法有效性來提升LLM生成程式碼的可靠性。本文揭示了一個違反直覺的風險：這種以可靠性為導向的技術本身可能成為攻擊面。我們發現一種名為CodeSpear的新型越獄攻擊，它利用GCD誘使LLM生成惡意程式碼。實驗表明，僅施加良善的程式碼語法約束，即可有效越獄LLM。為應對此漏洞，我們提出CodeShield安全對齊方法，即使在攻擊者控制的語法約束下，仍能穩健地維持安全行為。CodeShield透過教導模型在GCD下生成蜜罐程式碼，在程式碼模態中對齊模型。此類蜜罐程式碼在語義上無害（不執行惡意請求），且結構多樣（難以透過語法緊縮抑制）。同時，當自然語言可用時，CodeShield仍保留自然語言的拒絕回應。在4個基準測試中對10個主流LLM的實驗顯示，CodeSpear優於具代表性的越獄基線方法，平均攻擊成功率提升超過30個百分點。CodeShield則能在維持良性效用的同時，恢復CodeSpear攻擊下的安全性。我們的研究揭示了GCD的根本性風險，呼籲高度關注其潛在安全影響。

English

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.