构建适用于Gemini的生产级探针

摘要

前沿语言模型的能力正在快速提升，因此我们需要更强的防护机制来防止恶意行为者滥用日益强大的系统。已有研究表明激活探针可能是一种有效的滥用防控技术，但我们发现一个关键挑战：探针在面对重要生产环境分布变化时泛化能力不足。特别是从短上下文输入转向长上下文输入时，现有探针架构表现不佳。我们提出了几种能应对长上下文分布变化的新型探针架构。我们在网络攻防领域评估了这些探针，测试其针对多轮对话、静态越狱和自适应红队攻击等生产环境相关变化的鲁棒性。结果表明，虽然多最大值机制能应对上下文长度问题，但要实现广泛泛化仍需结合架构优化和多样化分布训练。此外，由于探针的计算效率优势，将其与提示分类器结合能以较低成本实现最优准确率。这些研究成果已成功应用于Gemini（谷歌前沿语言模型）用户端实例的滥用防控部署。最后，我们通过AlphaEvolve在探针架构搜索和自适应红队攻击方面取得了初步积极进展，表明部分AI安全研究已可实现自动化升级。

English

Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail to generalize under important production distribution shifts. In particular, we find that the shift from short-context to long-context inputs is difficult for existing probe architectures. We propose several new probe architecture that handle this long-context distribution shift. We evaluate these probes in the cyber-offensive domain, testing their robustness against various production-relevant shifts, including multi-turn conversations, static jailbreaks, and adaptive red teaming. Our results demonstrate that while multimax addresses context length, a combination of architecture choice and training on diverse distributions is required for broad generalization. Additionally, we show that pairing probes with prompted classifiers achieves optimal accuracy at a low cost due to the computational efficiency of probes. These findings have informed the successful deployment of misuse mitigation probes in user-facing instances of Gemini, Google's frontier language model. Finally, we find early positive results using AlphaEvolve to automate improvements in both probe architecture search and adaptive red teaming, showing that automating some AI safety research is already possible.

构建适用于Gemini的生产级探针

Building Production-Ready Probes For Gemini

摘要

Support