視覺語言模型在現實世界中安全嗎?基於迷因的基準測試研究
Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
May 21, 2025
作者: DongGeon Lee, Joonwon Jang, Jihae Jeong, Hwanjo Yu
cs.AI
摘要
視覺語言模型(VLMs)的快速部署放大了安全風險,然而大多數評估仍依賴於人工生成的圖像。本研究提出一個關鍵問題:當面對普通用戶分享的迷因圖像時,現有的VLMs安全性如何?為探討此問題,我們引入了MemeSafetyBench,這是一個包含50,430個實例的基準測試,將真實的迷因圖像與有害及無害的指令配對。通過全面的安全分類法和基於大型語言模型(LLM)的指令生成,我們評估了多個VLMs在單輪和多輪互動中的表現。我們研究了現實世界中的迷因如何影響有害輸出、對話語境的緩解效果,以及模型規模與安全指標之間的關係。研究結果表明,與合成或排版圖像相比,VLMs對基於迷因的有害提示表現出更大的脆弱性。相較於純文本輸入,迷因顯著增加了有害回應並降低了拒絕率。儘管多輪互動提供了一定程度的緩解,但高脆弱性依然存在。這些結果強調了生態效度評估和更強安全機制的必要性。
English
Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet
most evaluations rely on artificial images. This study asks: How safe are
current VLMs when confronted with meme images that ordinary users share? To
investigate this question, we introduce MemeSafetyBench, a 50,430-instance
benchmark pairing real meme images with both harmful and benign instructions.
Using a comprehensive safety taxonomy and LLM-based instruction generation, we
assess multiple VLMs across single and multi-turn interactions. We investigate
how real-world memes influence harmful outputs, the mitigating effects of
conversational context, and the relationship between model scale and safety
metrics. Our findings demonstrate that VLMs show greater vulnerability to
meme-based harmful prompts than to synthetic or typographic images. Memes
significantly increase harmful responses and decrease refusals compared to
text-only inputs. Though multi-turn interactions provide partial mitigation,
elevated vulnerability persists. These results highlight the need for
ecologically valid evaluations and stronger safety mechanisms.Summary
AI-Generated Summary