취약점 보상 모델을 활용한 온라인 강화 학습 기반 안전한 코드 생성

초록

대규모 언어 모델(LLM)이 소프트웨어 개발에 점점 더 많이 활용되고 있지만, 불안전한 코드를 생성하는 경향은 실제 배포에 있어 여전히 주요 장애물로 남아 있습니다. 기존의 안전한 코드 정렬 방법들은 종종 기능성과 보안 간의 딜레마에 직면하여, 보안을 향상시키는 대신 상당한 유틸리티 저하를 초래합니다. 본 논문에서는 기능성을 보존하는 안전한 코드 생성을 위한 온라인 강화 학습 프레임워크인 SecCoderX를 제안합니다. SecCoderX는 먼저 성숙한 취약점 탐지 자원을 두 가지 방식으로 재활용하여 취약점 탐지와 안전한 코드 생성을 연결합니다: (i) 온라인 RL 롤아웃을 위해 다양하고 현실 기반의 취약점 유발 코딩 과제를 합성하고, (ii) 확장 가능하고 신뢰할 수 있는 보안 감독을 제공하는 추론 기반 취약점 보상 모델을 학습합니다. 이러한 구성 요소들은 온라인 RL 루프에서 통합되어 코드 LLM이 안전하고 기능적인 코드를 생성하도록 정렬됩니다. 광범위한 실험을 통해 SecCoderX가 최첨단 성능을 달성하며, 정렬되지 않은 모델 대비 유효 안전률(ESR)을 약 10% 향상시키는 반면, 기존 방법들은 ESR을 14-54% 저하시키는 경우가 많음을 입증했습니다. 코드, 데이터셋 및 모델 체크포인트는 https://github.com/AndrewWTY/SecCoderX에서 공개합니다.

English

Large language models (LLMs) are increasingly used in software development, yet their tendency to generate insecure code remains a major barrier to real-world deployment. Existing secure code alignment methods often suffer from a functionality--security paradox, improving security at the cost of substantial utility degradation. We propose SecCoderX, an online reinforcement learning framework for functionality-preserving secure code generation. SecCoderX first bridges vulnerability detection and secure code generation by repurposing mature detection resources in two ways: (i) synthesizing diverse, reality-grounded vulnerability-inducing coding tasks for online RL rollouts, and (ii) training a reasoning-based vulnerability reward model that provides scalable and reliable security supervision. Together, these components are unified in an online RL loop to align code LLMs to generate secure and functional code. Extensive experiments demonstrate that SecCoderX achieves state-of-the-art performance, improving Effective Safety Rate (ESR) by approximately 10% over unaligned models, whereas prior methods often degrade ESR by 14-54%. We release our code, dataset and model checkpoints at https://github.com/AndrewWTY/SecCoderX.

취약점 보상 모델을 활용한 온라인 강화 학습 기반 안전한 코드 생성

Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model

초록

Support