VLMs는 분산된 훈련 패치를 통합할 수 있다.

초록

비전-언어 모델(Vision-Language Models, VLMs)의 위험을 완화하는 한 가지 방법은 학습 데이터에서 위험한 샘플을 제거하는 것입니다. 그러나 유해한 이미지가 작고 무해해 보이는 패치로 분할되어 여러 학습 샘플에 흩어져 있을 경우, 이러한 데이터 조정은 쉽게 우회될 수 있습니다. 이 경우, VLM은 학습 중에 이러한 조각들을 조합하여 학습하고, 추론 시 전체 이미지나 텍스트 참조로부터 유해한 응답을 생성할 수 있습니다. 예를 들어, 피가 낭자한 장면의 이미지 패치가 "안전하다"는 설명과 함께 학습되면, VLM은 나중에 해당 장면의 전체 이미지나 텍스트 참조를 "안전하다"고 설명할 수 있습니다. 우리는 이러한 공격을 가능하게 하는 VLM의 핵심 능력을 시각적 스티칭(visual stitching)으로 정의합니다. 이는 동일한 텍스트 설명을 공유하는 여러 학습 샘플에 걸쳐 퍼져 있는 시각적 정보를 통합하는 능력입니다. 본 연구에서는 먼저 세 가지 데이터셋에서 각 이미지가 고유한 합성 ID로 레이블링된 일반적인 오픈소스 VLM의 시각적 스티칭 능력을 입증합니다. 각 (이미지, ID) 쌍을 다양한 세분화 수준에서 {(패치, ID)} 쌍으로 분할하여 미세 조정을 수행한 결과, 조정된 모델이 전체 이미지나 텍스트 참조로부터 올바른 ID를 언어화할 수 있음을 확인했습니다. 이를 바탕으로, 위험한 이미지의 패치를 사용하고 ID를 "안전하다" 또는 "위험하다"와 같은 텍스트 설명으로 대체하여 위에서 언급한 적대적 데이터 중독 시나리오를 시뮬레이션했습니다. 이를 통해 유해한 콘텐츠가 패치에서 조정을 피하고 나중에 시각적 스티칭을 통해 재구성될 수 있음을 보여주며, 이는 VLM의 심각한 안전 위험을 초래할 수 있습니다. 코드는 https://github.com/ZHZisZZ/visual-stitching에서 확인할 수 있습니다.

English

One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as visual stitching -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each (image, ID) pair into {(patch, ID)} pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.

VLMs는 분산된 훈련 패치를 통합할 수 있다.

VLMs Can Aggregate Scattered Training Patches

초록

Support