ConceptAttention: 拡散トランスフォーマーは、高度に解釈可能な特徴を学習します。

要旨

マルチモーダル拡散トランスフォーマー（DiTs）の豊かな表現は、その解釈可能性を高める独自の特性を示すでしょうか？私たちはConceptAttentionを導入します。これは、DiTの注意層の表現力を活用して、高品質な視覚的概念を画像内で正確に特定するサリエンシーマップを生成する革新的な手法です。追加のトレーニングを必要とせず、ConceptAttentionは、DiTの注意層のパラメータを再利用して、高度に文脈に即した概念の埋め込みを生成します。この手法は、DiTの注意層の出力空間での線形射影を行うことが、一般的に使用されるクロス注意メカニズムよりもはるかに鮮明なサリエンシーマップを生成することを示す主要な発見をもたらします。驚くべきことに、ConceptAttentionは、ゼロショット画像セグメンテーションのベンチマークで最先端の性能を達成し、ImageNet-SegmentationデータセットおよびPascalVOCの単一クラスサブセットで、他の11のゼロショット解釈可能性手法を凌駕します。私たちの研究は、FluxなどのマルチモーダルDiTモデルの表現が、セグメンテーションなどの視覚タスクに高度に移転可能であり、CLIPのようなマルチモーダル基盤モデルをも凌駕することを初めて示す証拠を提供します。

English

Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention mechanisms. Remarkably, ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 11 other zero-shot interpretability methods on the ImageNet-Segmentation dataset and on a single-class subset of PascalVOC. Our work contributes the first evidence that the representations of multi-modal DiT models like Flux are highly transferable to vision tasks like segmentation, even outperforming multi-modal foundation models like CLIP.

ConceptAttention: 拡散トランスフォーマーは、高度に解釈可能な特徴を学習します。

ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

要旨

Support