A.S.E：评估AI生成代码安全性的仓库级基准测试平台

摘要

随着大型语言模型（LLMs）在软件工程中的日益普及，对其生成代码进行严格的安全评估变得至关重要。然而，现有基准测试存在不足，它们仅关注孤立的代码片段，采用缺乏可重复性的不稳定评估方法，且未能将输入上下文的质量与输出安全性联系起来。为填补这些空白，我们引入了A.S.E（AI代码生成安全评估），一个针对仓库级安全代码生成的基准测试。A.S.E从记录有CVE的真实仓库中构建任务，保留如构建系统和跨文件依赖等完整的仓库上下文。其可重复、容器化的评估框架利用专家定义的规则，提供稳定、可审计的安全性、构建质量和生成稳定性评估。我们在A.S.E上对领先的LLMs进行评估，揭示了三个关键发现：(1) Claude-3.7-Sonnet整体表现最佳。(2) 专有模型与开源模型之间的安全差距较小；Qwen3-235B-A22B-Instruct获得最高安全评分。(3) 简洁、“快速思考”的解码策略在安全补丁方面始终优于复杂、“慢速思考”的推理方法。

English

The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks are inadequate, as they focus on isolated code snippets, employ unstable evaluation methods that lack reproducibility, and fail to connect the quality of input context with the security of the output. To address these gaps, we introduce A.S.E (AI Code Generation Security Evaluation), a benchmark for repository-level secure code generation. A.S.E constructs tasks from real-world repositories with documented CVEs, preserving full repository context like build systems and cross-file dependencies. Its reproducible, containerized evaluation framework uses expert-defined rules to provide stable, auditable assessments of security, build quality, and generation stability. Our evaluation of leading LLMs on A.S.E reveals three key findings: (1) Claude-3.7-Sonnet achieves the best overall performance. (2) The security gap between proprietary and open-source models is narrow; Qwen3-235B-A22B-Instruct attains the top security score. (3) Concise, ``fast-thinking'' decoding strategies consistently outperform complex, ``slow-thinking'' reasoning for security patching.

A.S.E：评估AI生成代码安全性的仓库级基准测试平台

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

摘要

Support