语法约束解码可使大语言模型越狱生成恶意代码

摘要

大型语言模型（LLMs）越来越多地被用于代码生成，这引发了对它们可能被滥用以生成恶意代码的担忧。与此同时，语法约束解码（GCD）通过强制执行句法有效性，已被广泛应用于提升LLM生成代码的可靠性。在本文中，我们揭示了一个反直觉的风险：这种旨在提升可靠性的技术本身可能成为攻击面。我们发现了一种名为CodeSpear的新型越狱攻击，它利用GCD诱导LLM生成恶意代码。实验表明，仅应用良性的代码语法约束就能有效越狱LLM。为了解决这一漏洞，我们提出CodeShield，一种安全对齐方法，即使在攻击者控制的语法约束下也能稳健地保持安全行为。CodeShield通过教导模型在GCD下生成蜜罐代码来实现代码模态的对齐。这种代码在语义上无害（因此不会执行恶意请求），且在结构上多样（因此难以通过收紧语法来抑制）。同时，当自然语言可用时，CodeShield仍保留基于自然语言的拒绝响应。在4个基准测试中对10个流行LLM的实验表明，CodeSpear优于代表性越狱基线，平均攻击成功率提升超过30个百分点。CodeShield在CodeSpear攻击下恢复安全性的同时，仍保持良性效用。我们的发现揭示了GCD的根本性风险，并呼吁对其潜在安全影响给予更多关注。

English

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.