Cybench: 언어 모델의 사이버 보안 능력과 위험을 평가하기 위한 프레임워크

초록

사이버보안을 위한 언어 모델(LM) 에이전트는 취약점을 자율적으로 식별하고 공격을 실행할 수 있는 능력을 갖추고 있어 현실 세계에 영향을 줄 수 있는 잠재력을 지니고 있습니다. 정책 결정자, 모델 제공업체, 그리고 인공지능 및 사이버보안 커뮤니티의 다른 연구자들은 이러한 에이전트의 능력을 양적으로 평가하여 사이버 리스크를 완화하고 침투 테스트의 기회를 조사하는 데 관심을 가지고 있습니다. 이를 위해 우리는 사이버보안 작업을 명시하고 에이전트를 그 작업에 대해 평가하는 프레임워크인 Cybench를 소개합니다. 우리는 4개의 다른 CTF 대회에서 선택된 최근이고 의미 있는 다양한 난이도를 가진 40개의 프로페셔널 수준의 Capture the Flag (CTF) 작업을 포함하였습니다. 각 작업에는 해당 작업의 설명, 시작 파일이 포함되어 있으며, 에이전트가 bash 명령을 실행하고 출력을 관찰할 수 있는 환경에서 초기화됩니다. 많은 작업이 기존 LM 에이전트의 능력을 벗어나기 때문에 우리는 작업을 중간 단계로 분해하여 보다 점진적으로 평가하기 위한 서브태스크를 소개합니다. 40개의 작업 중 17개에 대해 서브태스크를 추가하였습니다. 에이전트 능력을 평가하기 위해 우리는 사이버보안 에이전트를 구축하고 GPT-4o, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, 그리고 Llama 3.1 405B Instruct 등 7가지 모델을 평가하였습니다. 지도 없이, 에이전트들은 인간 팀이 최대 11분이 걸린 가장 쉬운 완전한 작업만 해결할 수 있었으며, Claude 3.5 Sonnet과 GPT-4o가 가장 높은 성공률을 보였습니다. 마지막으로, 서브태스크는 지도 없는 실행에 비해 성능 측정을 위한 더 많은 신호를 제공하며, 서브태스크 지도를 받은 완전한 작업에서 모델들은 서브태스크 지도 없이 수행한 작업보다 3.2% 더 높은 성공률을 달성하였습니다. 모든 코드와 데이터는 https://cybench.github.io에서 공개적으로 이용 가능합니다.

English

Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have the potential to cause real-world impact. Policymakers, model providers, and other researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute bash commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks, which break down a task into intermediary steps for more gradated evaluation; we add subtasks for 17 of the 40 tasks. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 7 models: GPT-4o, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. Without guidance, we find that agents are able to solve only the easiest complete tasks that took human teams up to 11 minutes to solve, with Claude 3.5 Sonnet and GPT-4o having the highest success rates. Finally, subtasks provide more signal for measuring performance compared to unguided runs, with models achieving a 3.2\% higher success rate on complete tasks with subtask-guidance than without subtask-guidance. All code and data are publicly available at https://cybench.github.io

Cybench: 언어 모델의 사이버 보안 능력과 위험을 평가하기 위한 프레임워크

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models

초록

Support