Cybench:用于评估语言模型的网络安全能力和风险的框架
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models
August 15, 2024
作者: Andy K. Zhang, Neil Perry, Riya Dulepet, Eliot Jones, Justin W. Lin, Joey Ji, Celeste Menders, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Mike Yang, Teddy Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Polycarpos Yiorkadjis, Kenny Osele, Gautham Raghupathi, Dan Boneh, Daniel E. Ho, Percy Liang
cs.AI
摘要
具备自主识别漏洞并执行利用的网络安全语言模型(LM)代理可能会造成现实世界的影响。AI和网络安全社区的决策者、模型提供者以及其他研究人员对量化这些代理的能力感兴趣,以帮助减轻网络风险并探索渗透测试的机会。为此,我们引入了Cybench,一个用于指定网络安全任务并评估代理在这些任务上表现的框架。我们包含了来自4个不同CTF竞赛的40个专业级Capture the Flag(CTF)任务,这些任务被选择为最近的、有意义的,并覆盖了广泛的难度范围。每个任务都包括自己的描述、初始文件,并在代理可以执行bash命令并观察输出的环境中初始化。由于许多任务超出了现有LM代理的能力范围,我们引入了子任务,将任务分解为更多层次的评估中间步骤;我们为这40个任务中的17个添加了子任务。为了评估代理的能力,我们构建了一个网络安全代理,并评估了7个模型:GPT-4o、Claude 3 Opus、Claude 3.5 Sonnet、Mixtral 8x22b Instruct、Gemini 1.5 Pro、Llama 3 70B Chat和Llama 3.1 405B Instruct。我们发现,在没有指导的情况下,代理只能解决最简单的完整任务,这些任务人类团队需要最多11分钟才能解决,其中Claude 3.5 Sonnet和GPT-4o的成功率最高。最后,与未经指导的运行相比,子任务提供了更多衡量性能的信号,通过子任务指导,模型在完成任务时的成功率比没有子任务指导时高出3.2\%。所有代码和数据都可以在https://cybench.github.io 上公开获取。
English
Language Model (LM) agents for cybersecurity that are capable of autonomously
identifying vulnerabilities and executing exploits have the potential to cause
real-world impact. Policymakers, model providers, and other researchers in the
AI and cybersecurity communities are interested in quantifying the capabilities
of such agents to help mitigate cyberrisk and investigate opportunities for
penetration testing. Toward that end, we introduce Cybench, a framework for
specifying cybersecurity tasks and evaluating agents on those tasks. We include
40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF
competitions, chosen to be recent, meaningful, and spanning a wide range of
difficulties. Each task includes its own description, starter files, and is
initialized in an environment where an agent can execute bash commands and
observe outputs. Since many tasks are beyond the capabilities of existing LM
agents, we introduce subtasks, which break down a task into intermediary steps
for more gradated evaluation; we add subtasks for 17 of the 40 tasks. To
evaluate agent capabilities, we construct a cybersecurity agent and evaluate 7
models: GPT-4o, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct,
Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. Without
guidance, we find that agents are able to solve only the easiest complete tasks
that took human teams up to 11 minutes to solve, with Claude 3.5 Sonnet and
GPT-4o having the highest success rates. Finally, subtasks provide more signal
for measuring performance compared to unguided runs, with models achieving a
3.2\% higher success rate on complete tasks with subtask-guidance than without
subtask-guidance. All code and data are publicly available at
https://cybench.github.ioSummary
AI-Generated Summary