SCoCCA: 正準相関分析によるマルチモーダル疎概念分解

要旨

視覚言語モデルの内部的推論過程を解釈することは、安全性が重要な領域でAIを展開する上で不可欠である。概念ベースの説明可能性は、意味的に意味のある構成要素を通じてモデルの振る舞いを表現することで、人間に沿った解釈の枠組みを提供する。しかし、既存の手法は主に画像に限定されており、クロスモーダルな相互作用を見過ごしている。CLIPによって生成されるようなテキスト-画像埋め込みは、モダリティギャップに悩まされており、視覚的特徴とテキスト特徴が異なる分布に従うため、解釈可能性を制限している。正準相関分析（CCA）は、異なる分布からの特徴を整合させる原理的な方法を提供するが、マルチモーダルな概念レベルの分析には活用されていない。本研究では、CCAとInfoNCEの目的関数が密接に関連しており、CCAを最適化することがInfoNCEを暗黙的に最適化することを示す。これにより、事前学習済みのInfoNCE目的関数に影響を与えることなく、クロスモーダルな整合性を高める単純で訓練不要なメカニズムが提供される。この知見に基づき、概念ベースの説明可能性とCCAを組み合わせ、クロスモーダル埋め込みを整合させながら解釈可能な概念分解を可能にするフレームワークであるConcept CCA（CoCCA）を提案する。さらにこれを拡張し、スパース性を課すことで、より分離された識別的な概念を生成するSparse Concept CCA（SCoCCA）を提案する。これにより、活性化、アブレーション、意味的操作の改善が促進される。本手法は概念ベースの説明をマルチモーダル埋め込みに一般化し、概念アブレーションなどの再構成および操作タスクにおいて、概念発見の分野で最先端の性能を達成する。

English

Interpreting the internal reasoning of vision-language models is essential for deploying AI in safety-critical domains. Concept-based explainability provides a human-aligned lens by representing a model's behavior through semantically meaningful components. However, existing methods are largely restricted to images and overlook the cross-modal interactions. Text-image embeddings, such as those produced by CLIP, suffer from a modality gap, where visual and textual features follow distinct distributions, limiting interpretability. Canonical Correlation Analysis (CCA) offers a principled way to align features from different distributions, but has not been leveraged for multi-modal concept-level analysis. We show that the objectives of CCA and InfoNCE are closely related, such that optimizing CCA implicitly optimizes InfoNCE, providing a simple, training-free mechanism to enhance cross-modal alignment without affecting the pre-trained InfoNCE objective. Motivated by this observation, we couple concept-based explainability with CCA, introducing Concept CCA (CoCCA), a framework that aligns cross-modal embeddings while enabling interpretable concept decomposition. We further extend it and propose Sparse Concept CCA (SCoCCA), which enforces sparsity to produce more disentangled and discriminative concepts, facilitating improved activation, ablation, and semantic manipulation. Our approach generalizes concept-based explanations to multi-modal embeddings and achieves state-of-the-art performance in concept discovery, evidenced by reconstruction and manipulation tasks such as concept ablation.

SCoCCA: 正準相関分析によるマルチモーダル疎概念分解

SCoCCA: Multi-modal Sparse Concept Decomposition via Canonical Correlation Analysis

要旨

Support