SEM: 視覚言語モデルの事後バイアス除去のためのスパース埋め込み変調

要旨

CLIPに代表される視覚と言語を橋渡しするモデルはマルチモーダルAIの中核をなすが、大規模で未整理の訓練データには深刻な社会的バイアスや疑似相関バイアスが含まれている。既存の事後的なバイアス除去手法は、多くの場合、CLIPの密な埋め込み空間において直接操作されるが、この空間ではバイアスとタスク関連情報が強く絡み合っている。この絡み合いが、意味的忠実性を損なうことなくバイアスを除去する能力を制限している。本研究では、スパースオートエンコーダ（SAE）の潜在空間で動作する、事後的・ゼロショットのバイアス除去フレームワークであるSparse Embedding Modulation（SEM）を提案する。SEMはCLIPのテキスト埋め込みを分離された特徴量に分解し、クエリ関連ニューロンを保持しつつバイアス関連ニューロンを特定・変調する。これにより、より精緻な非線形介入が可能となる。4つのベンチマークデータセットと2つのCLIPバックボーンを用いた実験において、SEMは検索およびゼロショット分類タスクで大幅な公平性の向上を達成した。我々の結果は、スパースな潜在表現が視覚言語モデルの事後的バイアス除去における有効な基盤を提供することを示唆している。

English

Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.

SEM: 視覚言語モデルの事後バイアス除去のためのスパース埋め込み変調

SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

要旨

Support