비전-언어 모델은 실생활에서 안전한가? 밈 기반 벤치마크 연구

초록

비전-언어 모델(VLMs)의 급속한 배포는 안전 위험을 증폭시키고 있지만, 대부분의 평가는 인공적인 이미지에 의존하고 있습니다. 본 연구는 다음과 같은 질문을 던집니다: 일반 사용자들이 공유하는 밈 이미지를 마주했을 때, 현재의 VLMs는 얼마나 안전한가? 이 질문을 탐구하기 위해, 우리는 실제 밈 이미지를 유해 및 무해한 지시문과 짝지은 50,430개의 인스턴스로 구성된 MemeSafetyBench 벤치마크를 소개합니다. 포괄적인 안전 분류 체계와 LLM 기반 지시문 생성을 활용하여, 우리는 단일 및 다중 턴 상호작용에서 여러 VLMs를 평가합니다. 우리는 실제 밈이 유해한 출력에 미치는 영향, 대화적 맥락의 완화 효과, 그리고 모델 규모와 안전 지표 간의 관계를 조사합니다. 우리의 연구 결과는 VLMs가 합성 또는 타이포그래피 이미지보다 밈 기반 유해 프롬프트에 더 취약함을 보여줍니다. 밈은 텍스트 전용 입력에 비해 유해 응답을 크게 증가시키고 거부율을 감소시킵니다. 다중 턴 상호작용이 부분적인 완화를 제공하지만, 높은 취약성은 지속됩니다. 이러한 결과는 생태학적으로 타당한 평가와 더 강력한 안전 메커니즘의 필요성을 강조합니다.

English

Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs show greater vulnerability to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms.

비전-언어 모델은 실생활에서 안전한가? 밈 기반 벤치마크 연구

Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

초록

Support