SEM: 시각-언어 모델의 사후 편향 제어를 위한 희소 임베딩 변조 기법

초록

CLIP과 같은 시각-언어 간 연결을 위한 모델은 다중모달 AI의 핵심 구성 요소이지만, 대규모로 정제되지 않은 학습 데이터로 인해 심각한 사회적 편향과 허위 상관관계가 도입됩니다. 기존의 사후 편향 제거 방법들은 주로 CLIP의 조밀한 임베딩 공간에서 직접 작동하는데, 이 공간에서는 편향과 작업 관련 정보가 심하게 얽혀 있습니다. 이러한 얽힘은 의미적 정확도를 저하시키지 않으면서 편향을 제거하는 능력을 제한합니다. 본 연구에서는 희소 오토인코더(SAE) 잠재 공간에서 작동하는 사후 제로샷 편향 제거 프레임워크인 SEM(Sparse Embedding Modulation)을 제안합니다. SEM은 CLIP 텍스트 임베딩을 분리된 특징들로 분해함으로써, 쿼리 관련 뉴런은 보존한 채 편향 관련 뉴런을 식별하고 조절합니다. 이를 통해 더 정밀한 비선형 개입이 가능해집니다. 4개의 벤치마크 데이터셋과 2개의 CLIP 백본에 걸쳐, SEM은 검색 및 제로샷 분류 작업에서 상당한 공정성 향상을 달성했습니다. 우리의 결과는 희소 잠재 표현이 시각-언어 모델의 사후 편향 제거를 위한 효과적인 기반을 제공함을 보여줍니다.

English

Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.

SEM: 시각-언어 모델의 사후 편향 제어를 위한 희소 임베딩 변조 기법

SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

초록

Support