Phi: 추론 시점에서 다중 모드 대형 언어 모델의 선호도 하이재킹

초록

최근, 다중모달 대형 언어 모델(Multimodal Large Language Models, MLLMs)이 다양한 분야에서 상당한 주목을 받고 있습니다. 그러나 이러한 모델의 광범위한 활용은 심각한 안전 문제를 야기하기도 했습니다. 본 논문에서는 MLLMs의 새로운 안전 위험 요소를 밝혀냈습니다: MLLMs의 출력 선호도는 신중하게 최적화된 이미지를 통해 임의로 조작될 수 있습니다. 이러한 공격은 종종 명백히 유해하거나 비윤리적이지는 않지만 편향된 응답을 생성하며, 이는 탐지하기 어렵게 만듭니다. 구체적으로, 우리는 선호도 하이재킹(Preference Hijacking, Phi)이라는 새로운 방법을 소개합니다. 이 방법은 선호도가 하이재킹된 이미지를 사용하여 MLLM의 응답 선호도를 조작하며, 추론 시점에 작동하고 모델 수정이 필요하지 않습니다. 또한, 우리는 범용 하이재킹 섭동(universal hijacking perturbation)을 도입했습니다. 이는 다양한 이미지에 내장될 수 있는 전이 가능한 구성 요소로, MLLM의 응답을 공격자가 지정한 선호도로 하이재킹할 수 있습니다. 다양한 작업에 대한 실험 결과는 우리의 접근 방식의 효과를 입증합니다. Phi의 코드는 https://github.com/Yifan-Lan/Phi에서 확인할 수 있습니다.

English

Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.

Phi: 추론 시점에서 다중 모드 대형 언어 모델의 선호도 하이재킹

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

초록

Support