MSTS：一個用於視覺語言模型的多模式安全測試套件

摘要

視覺語言模型（VLMs）處理圖像和文本輸入，逐漸整合到聊天助手和其他消費者人工智慧應用中。然而，若缺乏適當的保障措施，VLMs可能提供有害建議（例如如何自我傷害）或鼓勵不安全行為（例如使用毒品）。儘管存在明顯的危險，迄今為止很少有研究評估VLM的安全性以及多模態輸入帶來的新風險。為填補這一空白，我們引入MSTS，一個針對VLMs的多模態安全測試套件。MSTS包含40個細粒度危害類別中的400個測試提示。每個測試提示由文本和圖像組成，只有結合在一起才能揭示完整的不安全含義。通過MSTS，我們發現幾個開放式VLM中存在明顯的安全問題。我們還發現一些VLM之所以安全，純屬意外，因為它們甚至無法理解簡單的測試提示。我們將MSTS翻譯成十種語言，展示非英語提示以提高不安全模型回應的比例。我們還展示，與多模態提示相比，僅使用文本進行測試時，模型更安全。最後，我們探索VLM安全評估的自動化，發現即使是最佳的安全分類器也存在不足。

English

Vision-language models (VLMs), which process image and text inputs, are increasingly integrated into chat assistants and other consumer AI applications. Without proper safeguards, however, VLMs may give harmful advice (e.g. how to self-harm) or encourage unsafe behaviours (e.g. to consume drugs). Despite these clear hazards, little work so far has evaluated VLM safety and the novel risks created by multimodal inputs. To address this gap, we introduce MSTS, a Multimodal Safety Test Suite for VLMs. MSTS comprises 400 test prompts across 40 fine-grained hazard categories. Each test prompt consists of a text and an image that only in combination reveal their full unsafe meaning. With MSTS, we find clear safety issues in several open VLMs. We also find some VLMs to be safe by accident, meaning that they are safe because they fail to understand even simple test prompts. We translate MSTS into ten languages, showing non-English prompts to increase the rate of unsafe model responses. We also show models to be safer when tested with text only rather than multimodal prompts. Finally, we explore the automation of VLM safety assessments, finding even the best safety classifiers to be lacking.

MSTS：一個用於視覺語言模型的多模式安全測試套件

MSTS: A Multimodal Safety Test Suite for Vision-Language Models

摘要

Support