視覺語言模型能夠聚合分散的訓練片段

摘要

降低視覺語言模型（VLMs）風險的一種方法是從其訓練數據中移除危險樣本。然而，當有害圖像被分割成看似無害的小塊並分散在多個訓練樣本中時，這種數據審核很容易被繞過。VLMs在訓練過程中可能會學會將這些片段拼湊起來，並在推理時從完整圖像或文本參考中生成有害回應。例如，如果模型在訓練時接觸到來自血腥場景的圖像塊，並配以“安全”的描述，VLMs之後可能會將完整圖像或對該場景的文本參考描述為“安全”。我們將VLMs促成此攻擊的核心能力定義為視覺拼接——即整合分散在多個共享相同文本描述的訓練樣本中的視覺信息的能力。在本研究中，我們首先在三種數據集上展示了常見開源VLMs的視覺拼接能力，其中每張圖像都標記了唯一的合成ID：我們將每對（圖像，ID）分割成不同粒度的{（圖塊，ID）}對進行微調，發現調整後的模型能夠從完整圖像或文本參考中正確表達出ID。基於此，我們模擬了上述對抗性數據投毒場景，使用來自危險圖像的圖塊並將ID替換為“安全”或“不安全”等文本描述，展示了有害內容如何通過圖塊規避審核，並隨後通過視覺拼接被重建，從而對VLM的安全性構成嚴重威脅。代碼可在https://github.com/ZHZisZZ/visual-stitching獲取。

English

One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as visual stitching -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each (image, ID) pair into {(patch, ID)} pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.

視覺語言模型能夠聚合分散的訓練片段

VLMs Can Aggregate Scattered Training Patches

摘要

Support