从防御Gemini对抗间接提示注入中汲取的经验教训
Lessons from Defending Gemini Against Indirect Prompt Injections
May 20, 2025
作者: Chongyang Shi, Sharon Lin, Shuang Song, Jamie Hayes, Ilia Shumailov, Itay Yona, Juliette Pluto, Aneesh Pappu, Christopher A. Choquette-Choo, Milad Nasr, Chawin Sitawarin, Gena Gibson, Andreas Terzis, John "Four" Flynn
cs.AI
摘要
Gemini正日益被用于代表用户执行任务,其函数调用与工具使用能力使模型能够访问用户数据。然而,部分工具需接触不可信数据,这引入了风险。攻击者可在不可信数据中嵌入恶意指令,导致模型偏离用户预期,错误处理其数据或权限。本报告阐述了Google DeepMind评估Gemini模型对抗性鲁棒性的方法,并总结了该过程中的主要经验教训。我们通过一个对抗性评估框架测试Gemini在面对复杂对手时的表现,该框架部署了一系列自适应攻击技术,持续针对Gemini的过去、当前及未来版本进行测试。我们说明了这些持续评估如何直接助力提升Gemini抵御操纵的能力。
English
Gemini is increasingly used to perform tasks on behalf of users, where
function-calling and tool-use capabilities enable the model to access user
data. Some tools, however, require access to untrusted data introducing risk.
Adversaries can embed malicious instructions in untrusted data which cause the
model to deviate from the user's expectations and mishandle their data or
permissions. In this report, we set out Google DeepMind's approach to
evaluating the adversarial robustness of Gemini models and describe the main
lessons learned from the process. We test how Gemini performs against a
sophisticated adversary through an adversarial evaluation framework, which
deploys a suite of adaptive attack techniques to run continuously against past,
current, and future versions of Gemini. We describe how these ongoing
evaluations directly help make Gemini more resilient against manipulation.Summary
AI-Generated Summary