Phi：多模态大语言模型在推理阶段的偏好劫持现象

摘要

近期，多模态大语言模型（MLLMs）在多个领域引起了广泛关注。然而，其广泛应用也引发了严重的安全隐患。本文揭示了一种MLLMs的新型安全风险：通过精心优化的图像，可以任意操控MLLMs的输出偏好。此类攻击常生成上下文相关但带有偏见的响应，这些响应既不显露出明显的有害性，也不违背伦理，因而难以被察觉。具体而言，我们提出了一种名为“偏好劫持”（Phi）的新方法，利用偏好被劫持的图像来操控MLLM的响应偏好。该方法在推理阶段实施，无需对模型进行任何修改。此外，我们引入了一种通用劫持扰动——一种可转移的组件，可嵌入不同图像中，将MLLM的响应导向攻击者指定的任何偏好。跨多种任务的实验结果验证了我们方法的有效性。Phi的代码可在https://github.com/Yifan-Lan/Phi获取。

English

Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.

Phi：多模态大语言模型在推理阶段的偏好劫持现象

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

摘要

Support