視覚言語モデルは実世界で安全か？ミームベースのベンチマーク研究

要旨

視覚言語モデル（VLM）の急速な展開は安全性リスクを増幅させるが、ほとんどの評価は人工的な画像に依存している。本研究では、一般ユーザーが共有するミーム画像に直面した場合、現在のVLMはどれほど安全なのかという問いを立てる。この問いを探るため、実在するミーム画像と有害・無害な指示を組み合わせた50,430インスタンスのベンチマーク「MemeSafetyBench」を導入する。包括的な安全性分類とLLMベースの指示生成を用いて、単一および複数ターンのインタラクションにおいて複数のVLMを評価する。実世界のミームが有害な出力に与える影響、会話コンテキストの緩和効果、モデル規模と安全性指標の関係を調査する。我々の調査結果は、VLMがミームベースの有害なプロンプトに対して、合成またはタイポグラフィックな画像よりも脆弱性が高いことを示している。ミームはテキストのみの入力と比較して、有害な応答を大幅に増加させ、拒否を減少させる。複数ターンのインタラクションは部分的に緩和効果をもたらすものの、脆弱性の高さは持続する。これらの結果は、生態学的に妥当な評価とより強力な安全メカニズムの必要性を強調している。

English

Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs show greater vulnerability to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms.

視覚言語モデルは実世界で安全か？ミームベースのベンチマーク研究

Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

要旨

Support