속도 편집을 통한 안전한 소수 단계 생성

초록

최근 흐름 매칭(Flow matching)은 최첨단 텍스트-이미지(T2I) 생성 패러다임으로 부상하여, 적은 수의 샘플링 단계로 고품질 생성을 가능하게 한다. 이러한 모델이 실제 응용 프로그램에 점점 더 통합됨에 따라 안전하고 민감하지 않은 콘텐츠 생성을 보장하는 것이 중요한 요구사항이 되었다. 그러나 이러한 새로운 생성 프레임워크에 안전성 및 개념 제거 방법을 적용하는 것은 여전히 미해결 과제로 남아 있다. 구체적으로, 기존 방법들은 다수의 잡음 제거 단계에 걸친 반복적인 궤적 조정 또는 CLIP 중심의 프롬프트 임베딩 조작에 크게 의존한다. 이러한 설계 가정은 흐름 매칭 기반 T2I 생성에서 안전성에 대한 근본적인 병목 현상을 초래하는데, 제한된 샘플링 단계는 반복적 수정을 제약하고, 현대의 맥락 인식 텍스트 인코더는 임베딩 수준의 개입 효과를 감소시키기 때문이다. 본 논문에서는 극도로 적은 샘플링 단계를 가진 흐름 매칭에 특화된 학습 없는 안전 방법인 VESFlow를 제안한다. 흐름 매칭 모델이 한계 속도(marginal velocity)를 학습한다는 사실을 활용하여, 안전 조건부 사후확률(safe-conditional posterior)을 통해 속도장(velocity field)을 직접 편집한다. VESFlow는 조건 프롬프트는 변경하지 않은 채 궤적을 안전한 출력으로 유도한다. VESFlow가 무해한 프롬프트 하에서는 출력을 변경하지 않는다는 관찰을 바탕으로, 위험 점수 기반 필터링을 추가로 도입하여 속도 편집을 생략함으로써 계산 비용을 줄이면서도 무해한 프롬프트 생성을 유지한다. 이 필터링을 기반으로, VESFlow의 강력한 변형인 VESFlow+를 제안하는데, 이는 속도를 안전 방향으로 편집할 뿐만 아니라 불안전 방향에서 멀어지도록 밀어낸다. 실험 결과, VESFlow+는 4단계 MeanFlow 모델에서 Ring-A-Bell 및 MMA-Diffusion에 대해 NudeNet 기준 공격 성공률을 각각 6.3% 및 6.8%로 감소시키며 대상 개념을 제거하면서도, 무해한 프롬프트에 대한 충실도는 유지함을 보여준다.

English

Flow matching has recently emerged as a strong paradigm for state-of-the-art text-to-image (T2I) generation, enabling high-quality generation with a small number of sampling steps. As these models are increasingly integrated into real-world applications, ensuring safe and non-sensitive content generation has become a critical requirement. However, adapting safety and concept removal methods to this new generation framework remains an open challenge. Specifically, prior methods largely rely on iterative trajectory steering across a number of denoising steps or on CLIP-centric prompt embedding manipulation. These design assumptions pose fundamental bottlenecks for safety in flow matching-based T2I generation, where limited sampling steps constrain iterative correction and modern context-aware text encoders diminish the effectiveness of embedding-level interventions. In this paper, we propose VESFlow, a training-free safety method tailored to flow matching with extremely few sampling steps. Leveraging the fact that flow matching models learn the marginal velocity, we directly edit the velocity field via a safe-conditional posterior. VESFlow steers the trajectory toward safe outputs while leaving the conditioning prompt unchanged. Building on the observation that VESFlow leaves outputs unchanged under benign prompts, we further introduce a risk score-based filtering that bypasses velocity editing to reduce computational cost while preserving benign prompt generation. Based on this filtering, we propose VESFlow+, a stronger variant of VESFlow that not only edits the velocity toward the safe direction, but also pushes it away from the unsafe direction. Experimental results show that VESFlow+ removes the target concept, reducing the attack success rate by NudeNet to 6.3% on Ring-A-Bell and 6.8% on MMA-Diffusion on the 4-step MeanFlow model, while preserving fidelity on benign prompts.