视觉-语言模型在现实应用中安全吗?基于表情包的基准研究
Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
May 21, 2025
作者: DongGeon Lee, Joonwon Jang, Jihae Jeong, Hwanjo Yu
cs.AI
摘要
视觉语言模型(VLMs)的快速部署放大了安全风险,然而大多数评估仍依赖于人工生成的图像。本研究提出:当面对普通用户分享的表情包图像时,当前的VLMs安全性如何?为探究这一问题,我们引入了MemeSafetyBench,一个包含50,430个实例的基准测试集,将真实的表情包图像与有害及无害的指令配对。通过采用全面的安全分类法和基于大语言模型(LLM)的指令生成,我们评估了多个VLMs在单轮及多轮交互中的表现。我们研究了现实世界中的表情包如何影响有害输出、对话上下文的缓解作用,以及模型规模与安全指标之间的关系。研究结果表明,相较于合成或文字图像,VLMs对基于表情包的有害提示表现出更高的脆弱性。与纯文本输入相比,表情包显著增加了有害响应并减少了拒绝率。尽管多轮交互提供了一定程度的缓解,但高脆弱性依然存在。这些结果强调了进行生态效度评估和加强安全机制的必要性。
English
Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet
most evaluations rely on artificial images. This study asks: How safe are
current VLMs when confronted with meme images that ordinary users share? To
investigate this question, we introduce MemeSafetyBench, a 50,430-instance
benchmark pairing real meme images with both harmful and benign instructions.
Using a comprehensive safety taxonomy and LLM-based instruction generation, we
assess multiple VLMs across single and multi-turn interactions. We investigate
how real-world memes influence harmful outputs, the mitigating effects of
conversational context, and the relationship between model scale and safety
metrics. Our findings demonstrate that VLMs show greater vulnerability to
meme-based harmful prompts than to synthetic or typographic images. Memes
significantly increase harmful responses and decrease refusals compared to
text-only inputs. Though multi-turn interactions provide partial mitigation,
elevated vulnerability persists. These results highlight the need for
ecologically valid evaluations and stronger safety mechanisms.Summary
AI-Generated Summary