概念關注：擴散Transformer學習高度可解釋的特徵

摘要

豐富的多模擬擴散Transformer（DiTs）的表示是否展現出能增強其可解釋性的獨特特性？我們引入了ConceptAttention，一種新穎的方法，利用DiT注意力層的表達能力來生成高質量的显著性地圖，精確定位圖像中的文本概念。ConceptAttention不需要額外的訓練，重新利用DiT注意力層的參數來產生高度情境化的概念嵌入，這是一項重大發現，即在DiT注意力層的輸出空間中執行線性投影，相較於常用的交叉注意力機制，能產生顯著更清晰的显著性地圖。值得注意的是，ConceptAttention甚至在零樣本圖像分割基準測試中取得了最先進的表現，在ImageNet-Segmentation數據集以及PascalVOC的單類別子集上，優於其他11種零樣本可解釋性方法。我們的工作首次證明，像Flux這樣的多模擬DiT模型的表示對於分割等視覺任務具有高度可轉移性，甚至優於像CLIP這樣的多模擬基礎模型。

English

Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention mechanisms. Remarkably, ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 11 other zero-shot interpretability methods on the ImageNet-Segmentation dataset and on a single-class subset of PascalVOC. Our work contributes the first evidence that the representations of multi-modal DiT models like Flux are highly transferable to vision tasks like segmentation, even outperforming multi-modal foundation models like CLIP.

概念關注：擴散Transformer學習高度可解釋的特徵

ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

摘要

Support