간접 프롬프트 주입에 대비한 제미니 방어에서 얻은 교훈

초록

Gemini는 사용자를 대신해 작업을 수행하는 데 점점 더 많이 활용되고 있으며, 기능 호출 및 도구 사용 능력을 통해 모델이 사용자 데이터에 접근할 수 있게 합니다. 그러나 일부 도구는 신뢰할 수 없는 데이터에 접근해야 하여 위험을 초래할 수 있습니다. 공격자는 신뢰할 수 없는 데이터에 악성 명령어를 삽입하여 모델이 사용자의 기대에서 벗어나 데이터나 권한을 잘못 처리하도록 할 수 있습니다. 본 보고서에서는 Google DeepMind가 Gemini 모델의 적대적 견고성을 평가하기 위해 채택한 접근 방식을 설명하고, 이 과정에서 얻은 주요 교훈을 기술합니다. 우리는 적대적 평가 프레임워크를 통해 정교한 공격자에 대항해 Gemini가 어떻게 수행되는지 테스트하며, 이 프레임워크는 과거, 현재, 미래 버전의 Gemini에 대해 지속적으로 실행되는 적응형 공격 기법 세트를 배포합니다. 이러한 지속적인 평가가 Gemini가 조작에 더욱 견고해지도록 직접적으로 도움을 주는 방식을 설명합니다.

English

Gemini is increasingly used to perform tasks on behalf of users, where function-calling and tool-use capabilities enable the model to access user data. Some tools, however, require access to untrusted data introducing risk. Adversaries can embed malicious instructions in untrusted data which cause the model to deviate from the user's expectations and mishandle their data or permissions. In this report, we set out Google DeepMind's approach to evaluating the adversarial robustness of Gemini models and describe the main lessons learned from the process. We test how Gemini performs against a sophisticated adversary through an adversarial evaluation framework, which deploys a suite of adaptive attack techniques to run continuously against past, current, and future versions of Gemini. We describe how these ongoing evaluations directly help make Gemini more resilient against manipulation.

간접 프롬프트 주입에 대비한 제미니 방어에서 얻은 교훈

Lessons from Defending Gemini Against Indirect Prompt Injections

초록

Support