ChatPaper.aiChatPaper

SecureCode v2.0:面向生产环境的代码生成模型安全训练数据集

SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models

December 20, 2025
作者: Scott Thornton
cs.AI

摘要

人工智能助手在45%的安全相关场景中会生成存在漏洞的代码,导致大量缺陷被引入生产系统。然而现有的安全编码数据集存在明显不足:缺乏实际事件依据、无法满足现代训练所需的规模、且缺失开发者在生产部署时所需的运营安全上下文。我们推出SecureCode v2.0——一个包含1,215个通过结构验证与专家安全审核的安全编码案例的生产级数据集。每个案例均关联具有CVE编号的实际安全事件文档,提供存在漏洞与安全可靠的代码实现,展示具体攻击手法,并包含纵深防御操作指南。该数据集涵盖11类漏洞(完整覆盖OWASP 2025十大安全风险及AI/ML安全威胁)和11种编程语言(Python、JavaScript、Java、Go、PHP、C#、TypeScript、Ruby、Rust、Kotlin以及用于基础设施即代码的YAML)。 我们的质量保证框架确保所有案例均基于真实事件。每个示例包含SIEM集成策略、基础设施加固建议(Docker、AppArmor、WAF配置)以及基于对应语言的测试框架实施方案。数据集采用四轮对话结构模拟真实开发者-AI交互场景,从基础实现逐步升级到高级安全考量与纵深防御指导。 我们的核心贡献包括:(1)1,215个经过严格验证的案例,划分为989个训练集、122个验证集和104个测试集;(2)确保数据集一致性的自动化验证框架;(3)体现真实安全 workflows 的四轮对话结构;(4)含SIEM集成策略的全面运营安全指南;(5)保持各语言实现准确性的完整技术方案;(6)开源发布数据集、验证工具及基准测试协议。
English
AI assistants produce vulnerable code in 45% of security-relevant scenarios, introducing flaws into production systems at scale. Yet existing secure coding datasets fall short. They lack incident grounding, don't provide the scale modern training requires, and miss the operational security context developers need for production deployments. We present SecureCode v2.0, a production-grade dataset of 1,215 security-focused coding examples that passed structural validation and expert security review. Every example ties to actual documented security incidents with CVE references, provides vulnerable and secure implementations, demonstrates concrete attacks, and includes defense-in-depth operational guidance. The dataset covers 11 vulnerability categories (complete OWASP Top 10:2025 plus AI/ML Security Threats) across 11 languages (Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, and YAML for infrastructure-as-code). Our quality assurance framework ensures complete incident grounding. Each example includes SIEM integration strategies, infrastructure hardening recommendations (Docker, AppArmor, WAF configurations), and testing approaches using language-appropriate frameworks. The dataset uses a 4-turn conversational structure mirroring actual developer-AI interactions, escalating from basic implementations to advanced security considerations and defense-in-depth guidance. Our contributions: (1) 1,215 rigorously validated examples split into 989 training, 122 validation, and 104 test sets, (2) an automated validation framework ensuring dataset consistency, (3) a 4-turn conversational structure capturing realistic security workflows, (4) comprehensive operational security guidance with SIEM integration strategies, (5) complete language-specific implementation fidelity, and (6) open-source release of data, validation tools, and benchmarking protocols.
PDF11December 24, 2025