ChatPaper.aiChatPaper

构建适用于Gemini的生产级探测器

Building Production-Ready Probes For Gemini

January 16, 2026
作者: János Kramár, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, Arthur Conmy
cs.AI

摘要

前沿语言模型的能力正在快速提升,因此我们需要更强大的防护机制来防止恶意行为者滥用日益强大的系统。已有研究表明激活探针可能是一种有效的滥用防护技术,但我们发现一个关键挑战:探针在重要的生产环境分布偏移下泛化能力不足。特别是从短上下文输入转向长上下文输入时,现有探针架构表现不佳。我们提出了几种能应对长上下文分布偏移的新型探针架构。 我们在网络攻击领域评估了这些探针,测试其针对多轮对话、静态越狱和自适应红队测试等生产环境相关偏移的鲁棒性。结果表明,虽然多重最大化方法能应对上下文长度问题,但实现广泛泛化仍需结合架构选择与多样化分布训练。此外,通过将探针与提示分类器结合,我们以较低成本实现了最优准确率,这得益于探针的计算效率。 这些研究成果已成功应用于Gemini(谷歌前沿语言模型)用户端实例中的滥用防护探针部署。最后,我们利用AlphaEvolve在探针架构搜索和自适应红队测试方面取得了初步积极成果,表明部分AI安全研究已可实现自动化。
English
Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail to generalize under important production distribution shifts. In particular, we find that the shift from short-context to long-context inputs is difficult for existing probe architectures. We propose several new probe architecture that handle this long-context distribution shift. We evaluate these probes in the cyber-offensive domain, testing their robustness against various production-relevant shifts, including multi-turn conversations, static jailbreaks, and adaptive red teaming. Our results demonstrate that while multimax addresses context length, a combination of architecture choice and training on diverse distributions is required for broad generalization. Additionally, we show that pairing probes with prompted classifiers achieves optimal accuracy at a low cost due to the computational efficiency of probes. These findings have informed the successful deployment of misuse mitigation probes in user-facing instances of Gemini, Google's frontier language model. Finally, we find early positive results using AlphaEvolve to automate improvements in both probe architecture search and adaptive red teaming, showing that automating some AI safety research is already possible.
PDF32January 20, 2026