概念關注:擴散Transformer學習高度可解釋的特徵
ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features
February 6, 2025
作者: Alec Helbling, Tuna Han Salih Meral, Ben Hoover, Pinar Yanardag, Duen Horng Chau
cs.AI
摘要
豐富的多模擬擴散Transformer(DiTs)的表示是否展現出能增強其可解釋性的獨特特性?我們引入了ConceptAttention,一種新穎的方法,利用DiT注意力層的表達能力來生成高質量的显著性地圖,精確定位圖像中的文本概念。ConceptAttention不需要額外的訓練,重新利用DiT注意力層的參數來產生高度情境化的概念嵌入,這是一項重大發現,即在DiT注意力層的輸出空間中執行線性投影,相較於常用的交叉注意力機制,能產生顯著更清晰的显著性地圖。值得注意的是,ConceptAttention甚至在零樣本圖像分割基準測試中取得了最先進的表現,在ImageNet-Segmentation數據集以及PascalVOC的單類別子集上,優於其他11種零樣本可解釋性方法。我們的工作首次證明,像Flux這樣的多模擬DiT模型的表示對於分割等視覺任務具有高度可轉移性,甚至優於像CLIP這樣的多模擬基礎模型。
English
Do the rich representations of multi-modal diffusion transformers (DiTs)
exhibit unique properties that enhance their interpretability? We introduce
ConceptAttention, a novel method that leverages the expressive power of DiT
attention layers to generate high-quality saliency maps that precisely locate
textual concepts within images. Without requiring additional training,
ConceptAttention repurposes the parameters of DiT attention layers to produce
highly contextualized concept embeddings, contributing the major discovery that
performing linear projections in the output space of DiT attention layers
yields significantly sharper saliency maps compared to commonly used
cross-attention mechanisms. Remarkably, ConceptAttention even achieves
state-of-the-art performance on zero-shot image segmentation benchmarks,
outperforming 11 other zero-shot interpretability methods on the
ImageNet-Segmentation dataset and on a single-class subset of PascalVOC. Our
work contributes the first evidence that the representations of multi-modal DiT
models like Flux are highly transferable to vision tasks like segmentation,
even outperforming multi-modal foundation models like CLIP.Summary
AI-Generated Summary