Cybench：評估語言模型的資安能力和風險的框架

摘要

具備自主識別漏洞並執行利用程式的語言模型（LM）代理，具有對現實世界產生影響的潛力。AI和網絡安全社區的政策制定者、模型提供者和其他研究人員對量化此類代理的能力感興趣，以幫助減輕網絡風險並探索滲透測試的機會。為此，我們引入了Cybench，這是一個用於指定網絡安全任務並評估代理在這些任務上的表現的框架。我們包括了來自4個不同CTF比賽的40個專業級Capture the Flag（CTF）任務，這些任務被選為最新、有意義且涵蓋了各種難度。每個任務都包括自己的描述、起始文件，並在一個環境中初始化，代理可以執行bash命令並觀察輸出。由於許多任務超出現有LM代理的能力範圍，我們引入了子任務，將任務分解為中間步驟進行更細緻的評估；我們為這40個任務中的17個添加了子任務。為了評估代理的能力，我們構建了一個網絡安全代理並評估了7個模型：GPT-4o、Claude 3 Opus、Claude 3.5 Sonnet、Mixtral 8x22b Instruct、Gemini 1.5 Pro、Llama 3 70B Chat和Llama 3.1 405B Instruct。在沒有指導的情況下，我們發現代理僅能解決最容易的完整任務，這些任務對人類團隊需要最多11分鐘才能解決，其中Claude 3.5 Sonnet和GPT-4o的成功率最高。最後，與未引導運行相比，子任務提供了更多用於測量性能的信號，憑藉子任務引導，模型在完成任務時的成功率比沒有子任務引導時高出3.2％。所有代碼和數據都可以在https://cybench.github.io 公開獲取。

English

Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have the potential to cause real-world impact. Policymakers, model providers, and other researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute bash commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks, which break down a task into intermediary steps for more gradated evaluation; we add subtasks for 17 of the 40 tasks. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 7 models: GPT-4o, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. Without guidance, we find that agents are able to solve only the easiest complete tasks that took human teams up to 11 minutes to solve, with Claude 3.5 Sonnet and GPT-4o having the highest success rates. Finally, subtasks provide more signal for measuring performance compared to unguided runs, with models achieving a 3.2\% higher success rate on complete tasks with subtask-guidance than without subtask-guidance. All code and data are publicly available at https://cybench.github.io

Cybench：評估語言模型的資安能力和風險的框架

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models

摘要

Support