MSTS: ビジョン言語モデルのためのマルチモーダル安全テストスイート

要旨

画像とテキストの入力を処理するビジョン言語モデル（VLM）は、チャットアシスタントや他の消費者向けAIアプリケーションにますます統合されています。しかしながら、適切な保護措置がないと、VLMは有害なアドバイス（たとえば、自傷行為の方法）を提供したり、安全でない行動（たとえば、薬物摂取を勧める）を促したりする可能性があります。これらの明確な危険性にもかかわらず、VLMの安全性や多モーダル入力によって生じる新たなリスクを評価した研究はほとんど行われていませんでした。このギャップを埋めるために、VLM向けのマルチモーダル安全性テストスイートであるMSTSを紹介します。MSTSには40の細分化された危険カテゴリにまたがる400のテストプロンプトが含まれています。各テストプロンプトは、テキストと画像が組み合わさることで初めてその完全な安全でない意味が明らかになります。MSTSを用いて、いくつかのオープンなVLMに明確な安全上の問題があることがわかりました。また、いくつかのVLMは偶然安全であることもわかりました。つまり、彼らは単純なテストプロンプトさえ理解できないために安全であるということです。MSTSを10の言語に翻訳し、英語以外のプロンプトを表示することで、安全でないモデルの反応率を高めました。また、テキストのみを使用した場合に比べて、マルチモーダルプロンプトでテストした場合にモデルがより安全であることを示しました。最後に、VLMの安全性評価の自動化を探求し、最高の安全分類器でさえ不十分であることがわかりました。

English

Vision-language models (VLMs), which process image and text inputs, are increasingly integrated into chat assistants and other consumer AI applications. Without proper safeguards, however, VLMs may give harmful advice (e.g. how to self-harm) or encourage unsafe behaviours (e.g. to consume drugs). Despite these clear hazards, little work so far has evaluated VLM safety and the novel risks created by multimodal inputs. To address this gap, we introduce MSTS, a Multimodal Safety Test Suite for VLMs. MSTS comprises 400 test prompts across 40 fine-grained hazard categories. Each test prompt consists of a text and an image that only in combination reveal their full unsafe meaning. With MSTS, we find clear safety issues in several open VLMs. We also find some VLMs to be safe by accident, meaning that they are safe because they fail to understand even simple test prompts. We translate MSTS into ten languages, showing non-English prompts to increase the rate of unsafe model responses. We also show models to be safer when tested with text only rather than multimodal prompts. Finally, we explore the automation of VLM safety assessments, finding even the best safety classifiers to be lacking.

MSTS: ビジョン言語モデルのためのマルチモーダル安全テストスイート

MSTS: A Multimodal Safety Test Suite for Vision-Language Models

要旨

Support