다국어 소프트웨어 취약점 탐지를 위한 대규모 언어 모델 벤치마킹

초록

최근 생성형 AI의 발전으로 소프트웨어 공학 분야에서 대규모 언어 모델(LLMs)의 광범위한 채택이 이루어지며, 오랫동안 해결되지 못한 수많은 문제들이 해결되고 있습니다. 그러나 소프트웨어 보안의 중요한 측면인 소프트웨어 취약점 탐지(SVD)에서 LLMs의 능력을 종합적으로 연구한 연구는 현재 부족한 상황입니다. 기존 연구는 주로 C/C++ 데이터셋을 사용하여 LLMs를 평가하는 데 초점을 맞추고 있으며, 오픈소스 LLMs에 대해 프롬프트 엔지니어링, 인스트럭션 튜닝, 시퀀스 분류 미세 조정 중 하나 또는 두 가지 전략만을 탐구하는 경우가 대부분입니다. 이로 인해 다양한 프로그래밍 언어에서 취약점을 탐지하는 데 있어 다양한 LLMs의 효과에 대한 상당한 지식 격차가 존재합니다. 이러한 지식 격차를 해결하기 위해, 우리는 SVD 작업에서 LLMs의 성능을 평가하는 종합적인 실증 연구를 제시합니다. 우리는 Python에서 8,260개, Java에서 7,505개, JavaScript에서 28,983개의 취약한 함수를 포함한 종합적인 데이터셋을 구축했습니다. 우리는 프롬프트 엔지니어링, 인스트럭션 튜닝, 시퀀스 분류 미세 조정을 포함한 여러 접근 방식을 사용하여 다섯 가지 오픈소스 LLMs를 평가합니다. 이러한 LLMs는 다섯 가지 미세 조정된 소규모 언어 모델과 두 가지 오픈소스 정적 애플리케이션 보안 테스트 도구와 비교됩니다. 또한, 우리는 SVD에서 LLMs의 성능을 향상시키기 위한 두 가지 방안을 탐구합니다: a) 데이터 관점: 다운샘플링된 균형 잡힌 데이터셋을 사용하여 모델을 재학습. b) 모델 관점: 여러 LLMs의 예측을 결합하는 앙상블 학습 방법 탐구. 우리의 종합적인 실험은 SVD가 LLMs에게 여전히 도전적인 작업임을 보여줍니다. 이 연구는 SVD에서 LLMs의 역할에 대한 철저한 이해를 제공하며, 소프트웨어 보안 관행을 강화하기 위해 생성형 AI를 활용하는 미래의 발전을 위한 실용적인 통찰을 제공합니다.

English

Recent advancements in generative AI have led to the widespread adoption of large language models (LLMs) in software engineering, addressing numerous long-standing challenges. However, a comprehensive study examining the capabilities of LLMs in software vulnerability detection (SVD), a crucial aspect of software security, is currently lacking. Existing research primarily focuses on evaluating LLMs using C/C++ datasets. It typically explores only one or two strategies among prompt engineering, instruction tuning, and sequence classification fine-tuning for open-source LLMs. Consequently, there is a significant knowledge gap regarding the effectiveness of diverse LLMs in detecting vulnerabilities across various programming languages. To address this knowledge gap, we present a comprehensive empirical study evaluating the performance of LLMs on the SVD task. We have compiled a comprehensive dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. We assess five open-source LLMs using multiple approaches, including prompt engineering, instruction tuning, and sequence classification fine-tuning. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools. Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data perspective: Retraining models using downsampled balanced datasets. b) Model perspective: Investigating ensemble learning methods that combine predictions from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains a challenging task for LLMs. This study provides a thorough understanding of the role of LLMs in SVD and offers practical insights for future advancements in leveraging generative AI to enhance software security practices.

다국어 소프트웨어 취약점 탐지를 위한 대규모 언어 모델 벤치마킹

Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection

초록

Support