Cybench: 言語モデルのサイバーセキュリティ能力とリスクを評価するフレームワーク

要旨

サイバーセキュリティにおける言語モデル（LM）エージェントは、脆弱性を自律的に特定し、エクスプロイトを実行する能力を有しており、現実世界に大きな影響を与える可能性があります。政策立案者、モデル提供者、およびAIとサイバーセキュリティコミュニティの他の研究者は、サイバーリスクを軽減し、ペネトレーションテストの機会を探るために、そのようなエージェントの能力を定量化することに興味を持っています。その目的に向けて、私たちはCybenchを紹介します。これは、サイバーセキュリティタスクを指定し、それらのタスクでエージェントを評価するためのフレームワークです。私たちは、4つの異なるCTF（Capture the Flag）競技から40のプロフェッショナルレベルのタスクを含めており、これらは最近のものであり、意味があり、幅広い難易度にわたるように選ばれています。各タスクには、その説明、スターターファイルが含まれており、エージェントがbashコマンドを実行し、出力を観察できる環境で初期化されています。多くのタスクは既存のLMエージェントの能力を超えているため、タスクを中間ステップに分解してより段階的な評価を行うためのサブタスクを導入しました。40のタスクのうち17のタスクにサブタスクを追加しました。エージェントの能力を評価するために、サイバーセキュリティエージェントを構築し、7つのモデルを評価しました：GPT-4o、Claude 3 Opus、Claude 3.5 Sonnet、Mixtral 8x22b Instruct、Gemini 1.5 Pro、Llama 3 70B Chat、およびLlama 3.1 405B Instructです。ガイダンスなしでは、エージェントは人間のチームが最大11分かけて解決する最も簡単な完全なタスクしか解決できず、Claude 3.5 SonnetとGPT-4oが最も高い成功率を示しました。最後に、サブタスクは、ガイダンスなしの実行と比較して、パフォーマンスを測定するためのより多くの信号を提供し、モデルはサブタスクガイダンスありの完全なタスクで3.2％高い成功率を達成しました。すべてのコードとデータはhttps://cybench.github.ioで公開されています。

English

Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have the potential to cause real-world impact. Policymakers, model providers, and other researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute bash commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks, which break down a task into intermediary steps for more gradated evaluation; we add subtasks for 17 of the 40 tasks. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 7 models: GPT-4o, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. Without guidance, we find that agents are able to solve only the easiest complete tasks that took human teams up to 11 minutes to solve, with Claude 3.5 Sonnet and GPT-4o having the highest success rates. Finally, subtasks provide more signal for measuring performance compared to unguided runs, with models achieving a 3.2\% higher success rate on complete tasks with subtask-guidance than without subtask-guidance. All code and data are publicly available at https://cybench.github.io

Cybench: 言語モデルのサイバーセキュリティ能力とリスクを評価するフレームワーク

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models

要旨

Support