從防禦Gemini對抗間接提示注入中汲取的教訓

摘要

Gemini 正日益被用於代表用戶執行任務，其函數調用與工具使用能力使模型能夠存取用戶數據。然而，某些工具需要存取不受信任的數據，這引入了風險。攻擊者可以在不受信任的數據中嵌入惡意指令，導致模型偏離用戶的預期，並錯誤處理其數據或權限。在本報告中，我們闡述了 Google DeepMind 評估 Gemini 模型對抗性魯棒性的方法，並描述了從這一過程中獲得的主要經驗教訓。我們通過一個對抗性評估框架測試 Gemini 如何應對複雜的攻擊者，該框架部署了一系列自適應攻擊技術，持續針對過去、當前及未來版本的 Gemini 進行測試。我們描述了這些持續的評估如何直接幫助提升 Gemini 對抗操縱的韌性。

English

Gemini is increasingly used to perform tasks on behalf of users, where function-calling and tool-use capabilities enable the model to access user data. Some tools, however, require access to untrusted data introducing risk. Adversaries can embed malicious instructions in untrusted data which cause the model to deviate from the user's expectations and mishandle their data or permissions. In this report, we set out Google DeepMind's approach to evaluating the adversarial robustness of Gemini models and describe the main lessons learned from the process. We test how Gemini performs against a sophisticated adversary through an adversarial evaluation framework, which deploys a suite of adaptive attack techniques to run continuously against past, current, and future versions of Gemini. We describe how these ongoing evaluations directly help make Gemini more resilient against manipulation.

從防禦Gemini對抗間接提示注入中汲取的教訓

Lessons from Defending Gemini Against Indirect Prompt Injections

摘要

Support