조건부 활성화 전달을 통한 T2I 안전성 조정

초록

놀라운 성능에도 불구하고, 현재의 텍스트-이미지(T2I) 모델들은 여전히 안전하지 않고 유해한 콘텐츠를 생성하는 경향이 있습니다. 활성화 조정(activation steering)이 추론 시점에 효과적인 개입 방법으로 주목받고 있지만, 선형 활성화 조정을 양성 프롬프트(benign prompt)에 적용할 경우 이미지 품질이 자주 저하된다는 점을 관찰했습니다. 이러한 트레이드오프를 해결하기 위해, 우리는 먼저 코사인 유사도가 높은 2,300개의 안전/불안전 프롬프트 쌍으로 구성된 대조 데이터셋인 SafeSteerDataset을 구축했습니다. 이 데이터를 활용하여, 우리는 기하학적 조건 설정 메커니즘과 비선형 변환 맵(transport map)을 사용하는 조건부 활성화 변환(Conditioned Activation Transport, CAT) 프레임워크를 제안합니다. 변환 맵이 불안전 활성화 영역 내에서만 작동하도록 조건을 설정함으로써, 양성 질의에 대한 간섭을 최소화합니다. 우리는 이 접근법을 두 가지 최신 아키텍처인 Z-Image와 Infinity에서 검증했습니다. 실험 결과, CAT가 이러한 백본 구조 전반에 효과적으로 일반화되며, 조정을 가하지 않은 생성 결과 대비 공격 성공률(Attack Success Rate)을 크게 낮추면서도 이미지 충실도(fidelity)를 유지함을 입증했습니다. 주의: 본 논문에는 공격적일 수 있는 텍스트와 이미지가 포함되어 있습니다.

English

Despite their impressive capabilities, current Text-to-Image (T2I) models remain prone to generating unsafe and toxic content. While activation steering offers a promising inference-time intervention, we observe that linear activation steering frequently degrades image quality when applied to benign prompts. To address this trade-off, we first construct SafeSteerDataset, a contrastive dataset containing 2300 safe and unsafe prompt pairs with high cosine similarity. Leveraging this data, we propose Conditioned Activation Transport (CAT), a framework that employs a geometry-based conditioning mechanism and nonlinear transport maps. By conditioning transport maps to activate only within unsafe activation regions, we minimize interference with benign queries. We validate our approach on two state-of-the-art architectures: Z-Image and Infinity. Experiments demonstrate that CAT generalizes effectively across these backbones, significantly reducing Attack Success Rate while maintaining image fidelity compared to unsteered generations. Warning: This paper contains potentially offensive text and images.

조건부 활성화 전달을 통한 T2I 안전성 조정

Conditioned Activation Transport for T2I Safety Steering

초록

Support