通过速度编辑的安全少步生成
Safe Few-Step Generation via Velocity Editing
June 22, 2026
作者: Yujin Choi, Jaehong Yoon
cs.AI
摘要
流匹配最近已成为最先进的文本到图像(T2I)生成领域的重要范式,能够通过少量采样步骤生成高质量图像。随着这些模型越来越多地集成到实际应用中,确保生成内容安全且无敏感信息已成为关键需求。然而,将安全性和概念移除方法适配到这一新的生成框架仍是一个开放挑战。具体来说,先前的方法主要依赖于跨多个去噪步骤的迭代轨迹引导,或基于CLIP中心的提示嵌入操作。这些设计假设在基于流匹配的T2I生成中带来了根本性瓶颈,因为有限的采样步骤限制了迭代校正,而现代上下文感知文本编码器则降低了嵌入层面干预的有效性。本文提出VESFlow,一种针对极少量采样步骤流匹配的免训练安全方法。利用流匹配模型学习边际速度这一特性,我们通过安全条件后验直接编辑速度场。VESFlow在保持条件提示不变的同时,将轨迹引导至安全输出。基于VESFlow在良性提示下输出不变的观察,我们进一步引入基于风险分数的过滤机制,通过跳过速度编辑来降低计算成本,同时保留良性提示的生成。基于该过滤机制,我们提出VESFlow+——VESFlow的更强变体,它不仅将速度向安全方向编辑,还使其远离不安全方向。实验结果表明,在4步MeanFlow模型上,VESFlow+移除了目标概念,将Ring-A-Bell的攻击成功率降低至NudeNet的6.3%,将MMA-Diffusion的攻击成功率降低至6.8%,同时保持了良性提示的保真度。
English
Flow matching has recently emerged as a strong paradigm for state-of-the-art text-to-image (T2I) generation, enabling high-quality generation with a small number of sampling steps. As these models are increasingly integrated into real-world applications, ensuring safe and non-sensitive content generation has become a critical requirement. However, adapting safety and concept removal methods to this new generation framework remains an open challenge. Specifically, prior methods largely rely on iterative trajectory steering across a number of denoising steps or on CLIP-centric prompt embedding manipulation. These design assumptions pose fundamental bottlenecks for safety in flow matching-based T2I generation, where limited sampling steps constrain iterative correction and modern context-aware text encoders diminish the effectiveness of embedding-level interventions. In this paper, we propose VESFlow, a training-free safety method tailored to flow matching with extremely few sampling steps. Leveraging the fact that flow matching models learn the marginal velocity, we directly edit the velocity field via a safe-conditional posterior. VESFlow steers the trajectory toward safe outputs while leaving the conditioning prompt unchanged. Building on the observation that VESFlow leaves outputs unchanged under benign prompts, we further introduce a risk score-based filtering that bypasses velocity editing to reduce computational cost while preserving benign prompt generation. Based on this filtering, we propose VESFlow+, a stronger variant of VESFlow that not only edits the velocity toward the safe direction, but also pushes it away from the unsafe direction. Experimental results show that VESFlow+ removes the target concept, reducing the attack success rate by NudeNet to 6.3% on Ring-A-Bell and 6.8% on MMA-Diffusion on the 4-step MeanFlow model, while preserving fidelity on benign prompts.