面向攻击性网络安全代理的动态风险评估

摘要

基础模型正日益成为更优秀的自主编程者，这引发了它们可能自动化危险网络攻击操作的前景。当前的前沿模型审计探究了此类智能体在网络安全方面的风险，但大多未能考虑到现实中对手可获得的自由度。特别是在具备强大验证机制和财务激励的情况下，攻击性网络安全的智能体易于被潜在对手通过迭代改进。我们主张，评估应在网络安全的背景下考虑扩展的威胁模型，强调对手在固定计算预算内，于有状态和无状态环境中可能拥有的不同自由度。我们的研究表明，即便在相对较小的计算预算下（本研究中为8个H100 GPU小时），对手也能将智能体在InterCode CTF上的网络安全能力相对于基线提升超过40%——且无需任何外部协助。这些结果强调了以动态方式评估智能体网络安全风险的必要性，从而描绘出更具代表性的风险图景。

English

Foundation models are increasingly becoming better autonomous programmers, raising the prospect that they could also automate dangerous offensive cyber-operations. Current frontier model audits probe the cybersecurity risks of such agents, but most fail to account for the degrees of freedom available to adversaries in the real world. In particular, with strong verifiers and financial incentives, agents for offensive cybersecurity are amenable to iterative improvement by would-be adversaries. We argue that assessments should take into account an expanded threat model in the context of cybersecurity, emphasizing the varying degrees of freedom that an adversary may possess in stateful and non-stateful environments within a fixed compute budget. We show that even with a relatively small compute budget (8 H100 GPU Hours in our study), adversaries can improve an agent's cybersecurity capability on InterCode CTF by more than 40\% relative to the baseline -- without any external assistance. These results highlight the need to evaluate agents' cybersecurity risk in a dynamic manner, painting a more representative picture of risk.