条件激活传输在文本到图像安全引导中的应用

摘要

尽管当前文生图模型具备卓越能力，但仍易生成不安全及有害内容。虽然激活引导技术为推理时干预提供了可行方案，但我们发现线性激活引导在应用于良性提示时常常会降低图像质量。为应对这一权衡问题，我们首先构建了SafeSteerDataset——一个包含2300对高余弦相似度的安全/不安全提示对比数据集。基于此数据，我们提出条件激活传输框架，该框架采用基于几何的调节机制和非线性传输映射。通过将传输映射限定在不安全激活区域内生效，我们最大程度减少对良性查询的干扰。我们在Z-Image和Infinity两种前沿架构上验证了该方法。实验表明，CAT能有效适配不同骨干网络，在保持未引导生成图像保真度的同时，显著降低攻击成功率。注：本文包含可能引发不适的文本与图像内容。

English

Despite their impressive capabilities, current Text-to-Image (T2I) models remain prone to generating unsafe and toxic content. While activation steering offers a promising inference-time intervention, we observe that linear activation steering frequently degrades image quality when applied to benign prompts. To address this trade-off, we first construct SafeSteerDataset, a contrastive dataset containing 2300 safe and unsafe prompt pairs with high cosine similarity. Leveraging this data, we propose Conditioned Activation Transport (CAT), a framework that employs a geometry-based conditioning mechanism and nonlinear transport maps. By conditioning transport maps to activate only within unsafe activation regions, we minimize interference with benign queries. We validate our approach on two state-of-the-art architectures: Z-Image and Infinity. Experiments demonstrate that CAT generalizes effectively across these backbones, significantly reducing Attack Success Rate while maintaining image fidelity compared to unsteered generations. Warning: This paper contains potentially offensive text and images.