공격적 사이버보안 에이전트를 위한 동적 위험 평가

초록

파운데이션 모델은 점점 더 우수한 자율 프로그래머로 발전하고 있으며, 이로 인해 위험한 공격적 사이버 작전도 자동화할 가능성이 높아지고 있다. 현재의 최첨단 모델 감사는 이러한 에이전트의 사이버 보안 위험을 탐구하지만, 대부분 실제 세계에서 적대자가 가질 수 있는 자유도를 고려하지 못하고 있다. 특히, 강력한 검증기와 금전적 인센티브가 주어지면 공격적 사이버 보안을 위한 에이전트는 잠재적 적대자에 의해 반복적으로 개선될 수 있다. 우리는 사이버 보안 맥락에서 확장된 위협 모델을 고려해야 한다고 주장하며, 고정된 컴퓨팅 예산 내에서 상태 유지 및 비상태 유지 환경에서 적대자가 가질 수 있는 다양한 자유도를 강조한다. 우리는 상대적으로 작은 컴퓨팅 예산(본 연구에서는 8 H100 GPU 시간)으로도 적대자가 외부 지원 없이 InterCode CTF에서 에이전트의 사이버 보안 능력을 기준치 대비 40% 이상 향상시킬 수 있음을 보여준다. 이러한 결과는 에이전트의 사이버 보안 위험을 동적인 방식으로 평가할 필요성을 강조하며, 더 대표적인 위험 그림을 그리는 것이 중요함을 시사한다.

English

Foundation models are increasingly becoming better autonomous programmers, raising the prospect that they could also automate dangerous offensive cyber-operations. Current frontier model audits probe the cybersecurity risks of such agents, but most fail to account for the degrees of freedom available to adversaries in the real world. In particular, with strong verifiers and financial incentives, agents for offensive cybersecurity are amenable to iterative improvement by would-be adversaries. We argue that assessments should take into account an expanded threat model in the context of cybersecurity, emphasizing the varying degrees of freedom that an adversary may possess in stateful and non-stateful environments within a fixed compute budget. We show that even with a relatively small compute budget (8 H100 GPU Hours in our study), adversaries can improve an agent's cybersecurity capability on InterCode CTF by more than 40\% relative to the baseline -- without any external assistance. These results highlight the need to evaluate agents' cybersecurity risk in a dynamic manner, painting a more representative picture of risk.