速度編集による安全な少数ステップ生成

要旨

フローマッチングは、近年、最先端のテキストから画像への生成（T2I）における強力なパラダイムとして登場し、少数のサンプリングステップで高品質な生成を実現しています。これらのモデルが実世界のアプリケーションにますます統合されるにつれて、安全で不適切でないコンテンツ生成を確保することが重要な要件となっています。しかしながら、この新しい生成フレームワークに安全性や概念除去の手法を適応させることは、依然として未解決の課題です。具体的には、従来の手法は主に、多数のノイズ除去ステップにわたる反復的な軌道制御や、CLIP中心のプロンプト埋め込み操作に依存しています。これらの設計上の前提は、限られたサンプリングステップが反復的な修正を制約し、現代の文脈認識型テキストエンコーダが埋め込みレベルでの介入の効果を低下させる、フローマッチングベースのT2I生成における安全性にとって根本的なボトルネックとなります。本稿では、VESFlowを提案します。これは、フローマッチングに特化し、極めて少ないサンプリングステップで動作する訓練不要の安全性手法です。フローマッチングモデルが限界速度（marginal velocity）を学習するという特性を活用し、安全条件付き事後分布を用いて速度場を直接編集します。VESFlowは、条件付けプロンプトを変更せずに、軌道を安全な出力へと導きます。VESFlowが良性プロンプト下では出力を変更しないという観察に基づき、さらにリスクスコアベースのフィルタリングを導入し、速度編集を迂回することで計算コストを削減しつつ、良性プロンプトの生成を維持します。このフィルタリングに基づき、VESFlowのより強力な変種であるVESFlow+を提案します。これは、速度を安全方向に編集するだけでなく、不安全方向から遠ざけるようにも作用します。実験結果から、VESFlow+は対象概念を除去し、4ステップのMeanFlowモデルにおいて、Ring-A-BellではNudeNetによる攻撃成功率を6.3％に、MMA-Diffusionでは6.8％に低減するとともに、良性プロンプトに対する忠実性を維持することが示されました。

English

Flow matching has recently emerged as a strong paradigm for state-of-the-art text-to-image (T2I) generation, enabling high-quality generation with a small number of sampling steps. As these models are increasingly integrated into real-world applications, ensuring safe and non-sensitive content generation has become a critical requirement. However, adapting safety and concept removal methods to this new generation framework remains an open challenge. Specifically, prior methods largely rely on iterative trajectory steering across a number of denoising steps or on CLIP-centric prompt embedding manipulation. These design assumptions pose fundamental bottlenecks for safety in flow matching-based T2I generation, where limited sampling steps constrain iterative correction and modern context-aware text encoders diminish the effectiveness of embedding-level interventions. In this paper, we propose VESFlow, a training-free safety method tailored to flow matching with extremely few sampling steps. Leveraging the fact that flow matching models learn the marginal velocity, we directly edit the velocity field via a safe-conditional posterior. VESFlow steers the trajectory toward safe outputs while leaving the conditioning prompt unchanged. Building on the observation that VESFlow leaves outputs unchanged under benign prompts, we further introduce a risk score-based filtering that bypasses velocity editing to reduce computational cost while preserving benign prompt generation. Based on this filtering, we propose VESFlow+, a stronger variant of VESFlow that not only edits the velocity toward the safe direction, but also pushes it away from the unsafe direction. Experimental results show that VESFlow+ removes the target concept, reducing the attack success rate by NudeNet to 6.3% on Ring-A-Bell and 6.8% on MMA-Diffusion on the 4-step MeanFlow model, while preserving fidelity on benign prompts.