视觉语言模型能够整合分散的训练片段

摘要

降低视觉语言模型（VLMs）风险的一种方法是剔除其训练数据中的危险样本。然而，当有害图像被分割成看似无害的小块，并分散在众多训练样本中时，这种数据过滤措施极易被绕过。VLMs在训练过程中可能学会将这些碎片拼接起来，并在推理时根据完整图像或文本引用生成有害响应。例如，若模型在训练时接触了血腥场景的图像块，且这些块与“安全”描述配对，VLMs随后可能会将完整图像或对该场景的文本引用描述为“安全”。我们将VLMs实现此类攻击的核心能力定义为视觉拼接——即整合散布于多个共享相同文本描述的训练样本中的视觉信息的能力。在本研究中，我们首先在三个数据集上展示了常见开源VLMs的视觉拼接能力，每个图像均标注有唯一的合成ID：我们将每对（图像，ID）分割成不同粒度的{（图像块，ID）}对进行微调，发现调整后的模型能够从完整图像或文本引用中准确表达出正确的ID。基于此，我们通过使用危险图像的图像块，并将ID替换为“安全”或“不安全”等文本描述，模拟了上述对抗性数据投毒场景，展示了有害内容如何通过图像块规避过滤，随后通过视觉拼接被重建，从而对VLM的安全性构成严重威胁。代码可在https://github.com/ZHZisZZ/visual-stitching获取。

English

One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as visual stitching -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each (image, ID) pair into {(patch, ID)} pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.

视觉语言模型能够整合分散的训练片段

VLMs Can Aggregate Scattered Training Patches

摘要

Support