進攻性網絡安全代理的動態風險評估

摘要

基础模型正日益成为更优秀的自主编程者，这提升了它们可能自动化执行危险网络攻击操作的前景。当前的前沿模型审计探究了此类代理的网络安全风险，但多数未能考虑到现实世界中对手可获得的自由度。特别是，在强大的验证机制和财务激励下，攻击性网络安全的代理易于被潜在对手进行迭代改进。我们主张，评估应在网络安全的背景下考虑扩展的威胁模型，强调对手在有状态和无状态环境中，在固定的计算预算内可能拥有的不同程度自由度。我们的研究表明，即使计算预算相对较小（在我们的研究中为8个H100 GPU小时），对手也能将代理在InterCode CTF上的网络安全能力相对于基线提高超过40%——无需任何外部协助。这些结果强调了以动态方式评估代理网络安全风险的必要性，从而描绘出更具代表性的风险图景。

English

Foundation models are increasingly becoming better autonomous programmers, raising the prospect that they could also automate dangerous offensive cyber-operations. Current frontier model audits probe the cybersecurity risks of such agents, but most fail to account for the degrees of freedom available to adversaries in the real world. In particular, with strong verifiers and financial incentives, agents for offensive cybersecurity are amenable to iterative improvement by would-be adversaries. We argue that assessments should take into account an expanded threat model in the context of cybersecurity, emphasizing the varying degrees of freedom that an adversary may possess in stateful and non-stateful environments within a fixed compute budget. We show that even with a relatively small compute budget (8 H100 GPU Hours in our study), adversaries can improve an agent's cybersecurity capability on InterCode CTF by more than 40\% relative to the baseline -- without any external assistance. These results highlight the need to evaluate agents' cybersecurity risk in a dynamic manner, painting a more representative picture of risk.