邁向可靠擴散採樣的前沿:基於對抗性Sinkhorn注意力引導
Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance
November 10, 2025
作者: Kwanyoung Kim
cs.AI
摘要
擴散模型在使用分類器自由引導(CFG)等指導方法時展現出強大的生成效能,這些方法通過修改採樣軌跡來提升輸出品質。此類技術通常透過刻意劣化特定輸出(例如無條件輸出)來強化目標輸出,並採用啟發式擾動函數如恆等混合或模糊條件。然而,這些方法缺乏理論基礎,且依賴人工設計的失真策略。本研究提出對抗性Sinkhorn注意力引導(ASAG),這是一種新穎方法,從最優傳輸角度重新詮釋擴散模型中的注意力分數,並通過Sinkhorn算法刻意擾動傳輸成本。ASAG並非簡單破壞注意力機制,而是在自注意力層中注入對抗性成本以降低查詢與鍵之間的像素級相似度。這種刻意劣化能削弱誤導性注意力對齊,從而提升條件與無條件樣本品質。ASAG在文本到圖像擴散任務中展現穩定改進,並於IP-Adapter、ControlNet等下游應用中增強可控性與保真度。該方法具輕量級、即插即用特性,無需模型重新訓練即可提升可靠性。
English
Diffusion models have demonstrated strong generative performance when using guidance methods such as classifier-free guidance (CFG), which enhance output quality by modifying the sampling trajectory. These methods typically improve a target output by intentionally degrading another, often the unconditional output, using heuristic perturbation functions such as identity mixing or blurred conditions. However, these approaches lack a principled foundation and rely on manually designed distortions. In this work, we propose Adversarial Sinkhorn Attention Guidance (ASAG), a novel method that reinterprets attention scores in diffusion models through the lens of optimal transport and intentionally disrupt the transport cost via Sinkhorn algorithm. Instead of naively corrupting the attention mechanism, ASAG injects an adversarial cost within self-attention layers to reduce pixel-wise similarity between queries and keys. This deliberate degradation weakens misleading attention alignments and leads to improved conditional and unconditional sample quality. ASAG shows consistent improvements in text-to-image diffusion, and enhances controllability and fidelity in downstream applications such as IP-Adapter and ControlNet. The method is lightweight, plug-and-play, and improves reliability without requiring any model retraining.