A.S.E：用於評估AI生成代碼安全性的存儲庫級基準測試

摘要

大型語言模型（LLMs）在軟體工程中的日益普及，亟需對其生成程式碼進行嚴謹的安全性評估。然而，現有的基準測試存在不足，因其專注於孤立的程式片段，採用缺乏可重現性的不穩定評估方法，且未能將輸入上下文的品質與輸出安全性相聯繫。為彌補這些缺陷，我們引入了A.S.E（AI程式碼生成安全性評估），這是一個針對倉庫層級安全程式碼生成的基準測試。A.S.E從記錄有CVE的真實倉庫中構建任務，保留如建置系統和跨檔案依賴等完整倉庫上下文。其可重現、容器化的評估框架，利用專家定義的規則，提供穩定、可審計的安全性、建置品質及生成穩定性的評估。我們在A.S.E上對領先的LLMs進行評估，揭示了三個關鍵發現：(1) Claude-3.7-Sonnet表現最佳。(2) 專有模型與開源模型之間的安全性差距微小；Qwen3-235B-A22B-Instruct獲得最高安全評分。(3) 簡潔的「快速思考」解碼策略在安全修補上持續優於複雜的「慢速思考」推理。

English

The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks are inadequate, as they focus on isolated code snippets, employ unstable evaluation methods that lack reproducibility, and fail to connect the quality of input context with the security of the output. To address these gaps, we introduce A.S.E (AI Code Generation Security Evaluation), a benchmark for repository-level secure code generation. A.S.E constructs tasks from real-world repositories with documented CVEs, preserving full repository context like build systems and cross-file dependencies. Its reproducible, containerized evaluation framework uses expert-defined rules to provide stable, auditable assessments of security, build quality, and generation stability. Our evaluation of leading LLMs on A.S.E reveals three key findings: (1) Claude-3.7-Sonnet achieves the best overall performance. (2) The security gap between proprietary and open-source models is narrow; Qwen3-235B-A22B-Instruct attains the top security score. (3) Concise, ``fast-thinking'' decoding strategies consistently outperform complex, ``slow-thinking'' reasoning for security patching.

A.S.E：用於評估AI生成代碼安全性的存儲庫級基準測試

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

摘要

Support