문법 제약 디코딩은 LLM을 탈옥시켜 악성 코드를 생성하게 할 수 있다.

초록

대규모 언어 모델(LLM)이 코드 생성에 점점 더 많이 활용되면서, 이들이 악성 코드를 생성하는 데 오용될 수 있다는 우려가 제기되고 있다. 한편, 문법 제약 디코딩(GCD)은 구문적 유효성을 강제하여 LLM이 생성하는 코드의 신뢰성을 향상시키기 위해 널리 채택되어 왔다. 본 논문에서는 역설적인 위험을 밝혀낸다: 바로 이러한 신뢰성 지향 기술 자체가 공격 표면이 될 수 있다는 점이다. 우리는 CodeSpear라는 새로운 탈옥 공격을 발견하였으며, 이는 GCD를 악용하여 LLM이 악성 코드를 생성하도록 유도한다. 실험 결과, 단순히 무해한 코드 문법 제약을 적용하는 것만으로도 LLM을 효과적으로 탈옥시킬 수 있음을 보여준다. 이러한 취약점에 대응하기 위해, 우리는 공격자가 통제하는 문법 제약 하에서도 안전한 동작을 강건하게 유지하는 안전 정렬 접근법인 CodeShield를 제안한다. CodeShield는 GCD 하에서 허니팟 코드를 생성하도록 모델을 학습시켜 코드 모달리티 내에서 정렬을 수행한다. 이러한 코드는 의미적으로 무해하여 악성 요청을 실행하지 않으며, 구조적으로 다양하여 문법 강화를 통해 억제하기 어렵다. 동시에 CodeShield는 자연어가 사용 가능한 경우 자연어 기반의 거부 응답도 유지한다. 4개 벤치마크에서 10개의 인기 LLM을 대상으로 한 실험 결과, CodeSpear는 대표적인 탈옥 기준선보다 우수한 성능을 보였으며, 평균 공격 성공률을 30% 포인트 이상 증가시켰다. 또한 CodeShield는 CodeSpear 하에서 안전성을 회복하면서도 무해한 유틸리티를 유지한다. 본 연구 결과는 GCD의 근본적인 위험을 드러내며, 그 잠재적인 보안 함의에 대한 더 큰 관심을 촉구한다.

English

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.