Cybench：用于评估语言模型的网络安全能力和风险的框架

摘要

具备自主识别漏洞并执行利用的网络安全语言模型（LM）代理可能会造成现实世界的影响。AI和网络安全社区的决策者、模型提供者以及其他研究人员对量化这些代理的能力感兴趣，以帮助减轻网络风险并探索渗透测试的机会。为此，我们引入了Cybench，一个用于指定网络安全任务并评估代理在这些任务上表现的框架。我们包含了来自4个不同CTF竞赛的40个专业级Capture the Flag（CTF）任务，这些任务被选择为最近的、有意义的，并覆盖了广泛的难度范围。每个任务都包括自己的描述、初始文件，并在代理可以执行bash命令并观察输出的环境中初始化。由于许多任务超出了现有LM代理的能力范围，我们引入了子任务，将任务分解为更多层次的评估中间步骤；我们为这40个任务中的17个添加了子任务。为了评估代理的能力，我们构建了一个网络安全代理，并评估了7个模型：GPT-4o、Claude 3 Opus、Claude 3.5 Sonnet、Mixtral 8x22b Instruct、Gemini 1.5 Pro、Llama 3 70B Chat和Llama 3.1 405B Instruct。我们发现，在没有指导的情况下，代理只能解决最简单的完整任务，这些任务人类团队需要最多11分钟才能解决，其中Claude 3.5 Sonnet和GPT-4o的成功率最高。最后，与未经指导的运行相比，子任务提供了更多衡量性能的信号，通过子任务指导，模型在完成任务时的成功率比没有子任务指导时高出3.2\%。所有代码和数据都可以在https://cybench.github.io 上公开获取。

English

Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have the potential to cause real-world impact. Policymakers, model providers, and other researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute bash commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks, which break down a task into intermediary steps for more gradated evaluation; we add subtasks for 17 of the 40 tasks. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 7 models: GPT-4o, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. Without guidance, we find that agents are able to solve only the easiest complete tasks that took human teams up to 11 minutes to solve, with Claude 3.5 Sonnet and GPT-4o having the highest success rates. Finally, subtasks provide more signal for measuring performance compared to unguided runs, with models achieving a 3.2\% higher success rate on complete tasks with subtask-guidance than without subtask-guidance. All code and data are publicly available at https://cybench.github.io

Cybench：用于评估语言模型的网络安全能力和风险的框架

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models

摘要

Support