基于漏洞奖励模型的在线强化学习安全代码生成

摘要

大型语言模型（LLMs）在软件开发中的应用日益广泛，但其生成不安全代码的倾向仍是实际部署的主要障碍。现有安全代码对齐方法常陷入功能性与安全性的悖论——以显著牺牲实用性为代价来提升安全性。我们提出SecCoderX，一种基于在线强化学习的功能保持型安全代码生成框架。该框架通过两种方式整合成熟的漏洞检测资源，构建漏洞检测与安全代码生成之间的桥梁：（i）合成多样化、基于真实场景的漏洞诱导型编程任务，用于在线强化学习的轨迹推演；（ii）训练基于推理的漏洞奖励模型，提供可扩展且可靠的安全监督。这些组件通过在线强化学习循环统一协作，使代码LLMs生成兼具安全性与功能性的代码。大量实验表明，SecCoderX实现了最先进的性能，将有效安全率（ESR）较未对齐模型提升约10%，而现有方法往往会使ESR降低14-54%。相关代码、数据集及模型检查点已发布于https://github.com/AndrewWTY/SecCoderX。

English

Large language models (LLMs) are increasingly used in software development, yet their tendency to generate insecure code remains a major barrier to real-world deployment. Existing secure code alignment methods often suffer from a functionality--security paradox, improving security at the cost of substantial utility degradation. We propose SecCoderX, an online reinforcement learning framework for functionality-preserving secure code generation. SecCoderX first bridges vulnerability detection and secure code generation by repurposing mature detection resources in two ways: (i) synthesizing diverse, reality-grounded vulnerability-inducing coding tasks for online RL rollouts, and (ii) training a reasoning-based vulnerability reward model that provides scalable and reliable security supervision. Together, these components are unified in an online RL loop to align code LLMs to generate secure and functional code. Extensive experiments demonstrate that SecCoderX achieves state-of-the-art performance, improving Effective Safety Rate (ESR) by approximately 10% over unaligned models, whereas prior methods often degrade ESR by 14-54%. We release our code, dataset and model checkpoints at https://github.com/AndrewWTY/SecCoderX.

基于漏洞奖励模型的在线强化学习安全代码生成

Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model

摘要

Support