VLMは分散したトレーニングパッチを集約することが可能である

要旨

視覚言語モデル（VLM）のリスクを軽減する方法の一つは、学習データから危険なサンプルを除去することです。しかし、有害な画像が小さな無害に見えるパッチに分割され、多くの学習サンプルに散りばめられた場合、このようなデータの管理は簡単に回避されてしまいます。VLMは学習中にこれらの断片を組み合わせることを学び、推論時に完全な画像やテキスト参照から有害な応答を生成する可能性があります。例えば、血まみれのシーンの画像パッチが「安全」という説明と共に学習された場合、VLMは後で完全な画像やそのシーンへのテキスト参照を「安全」と説明するかもしれません。我々は、この攻撃を可能にするVLMの核心的な能力を「視覚的縫合」と定義します。これは、同じテキスト説明を共有する複数の学習サンプルに分散した視覚情報を統合する能力です。本研究では、まず、各画像が一意の合成IDでラベル付けされた3つのデータセットにおいて、一般的なオープンソースVLMの視覚的縫合能力を実証します。各（画像、ID）ペアを異なる粒度で{（パッチ、ID）}ペアに分割してファインチューニングを行い、チューニングされたモデルが完全な画像やテキスト参照から正しいIDを言語化できることを確認します。これを基に、危険な画像のパッチを使用し、IDを「安全」や「危険」などのテキスト説明に置き換えることで、前述の敵対的データ汚染シナリオをシミュレートします。これにより、有害なコンテンツがパッチでの管理を回避し、後に視覚的縫合を通じて再構築されることで、深刻なVLMの安全性リスクが生じることを示します。コードはhttps://github.com/ZHZisZZ/visual-stitchingで公開されています。

English

One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as visual stitching -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each (image, ID) pair into {(patch, ID)} pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.

VLMは分散したトレーニングパッチを集約することが可能である

VLMs Can Aggregate Scattered Training Patches

要旨

Support