Phi: 推論時におけるマルチモーダル大規模言語モデルの選好ハイジャック

要旨

近年、マルチモーダル大規模言語モデル（MLLMs）がさまざまな分野で注目を集めている。しかし、その広範な採用は深刻な安全性の問題も引き起こしている。本論文では、MLLMsの新たな安全性リスクを明らかにする：MLLMsの出力傾向は、慎重に最適化された画像によって任意に操作される可能性がある。このような攻撃は、文脈上関連性がありながらも偏った応答を生成することが多く、明らかに有害でも非倫理的でもないため、検出が困難である。具体的には、本論文では、選好ハイジャック画像を用いてMLLMsの応答傾向を操作する新たな手法、Preference Hijacking（Phi）を提案する。この手法は推論時に動作し、モデルの変更を必要としない。さらに、ユニバーサルハイジャック摂動を導入する。これは、異なる画像に埋め込むことで、攻撃者が指定した選好に向けてMLLMsの応答をハイジャックする転移可能な要素である。さまざまなタスクにおける実験結果は、本手法の有効性を示している。Phiのコードはhttps://github.com/Yifan-Lan/Phiで公開されている。

English

Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.

Phi: 推論時におけるマルチモーダル大規模言語モデルの選好ハイジャック

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

要旨

Support