Mind-the-Glitch: 被写体駆動生成における不整合検出のための視覚的対応付け

要旨

我々は、事前学習済み拡散モデルのバックボーンから視覚的特徴と意味的特徴を分離する新たなアプローチを提案する。これにより、確立された意味的対応と同様の方法で視覚的対応を可能にする。拡散モデルのバックボーンは意味的に豊かな特徴をエンコードすることが知られているが、画像合成能力を支えるためには視覚的特徴も含まれている必要がある。しかし、注釈付きデータセットの欠如により、これらの視覚的特徴を分離することは困難である。この問題に対処するため、我々は既存の被写体駆動型画像生成データセットに基づいて、意味的および視覚的対応が注釈された画像ペアを構築する自動化パイプラインを導入し、2種類の特徴を分離するためのコントラスティブアーキテクチャを設計する。分離された表現を活用して、被写体駆動型画像生成における視覚的不整合を定量化する新しい指標、Visual Semantic Matching (VSM)を提案する。実験結果は、我々のアプローチがCLIP、DINO、視覚-言語モデルなどのグローバル特徴ベースの指標を上回り、視覚的不整合を定量化するだけでなく、不整合領域の空間的ローカライゼーションも可能にすることを示している。我々の知る限り、これは被写体駆動型生成における不整合の定量化とローカライゼーションの両方をサポートする初めての手法であり、このタスクを進めるための貴重なツールを提供する。プロジェクトページ: https://abdo-eldesokey.github.io/mind-the-glitch/

English

We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models, enabling visual correspondence in a manner analogous to the well-established semantic correspondence. While diffusion model backbones are known to encode semantically rich features, they must also contain visual features to support their image synthesis capabilities. However, isolating these visual features is challenging due to the absence of annotated datasets. To address this, we introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences based on existing subject-driven image generation datasets, and design a contrastive architecture to separate the two feature types. Leveraging the disentangled representations, we propose a new metric, Visual Semantic Matching (VSM), that quantifies visual inconsistencies in subject-driven image generation. Empirical results show that our approach outperforms global feature-based metrics such as CLIP, DINO, and vision--language models in quantifying visual inconsistencies while also enabling spatial localization of inconsistent regions. To our knowledge, this is the first method that supports both quantification and localization of inconsistencies in subject-driven generation, offering a valuable tool for advancing this task. Project Page:https://abdo-eldesokey.github.io/mind-the-glitch/

Mind-the-Glitch: 被写体駆動生成における不整合検出のための視覚的対応付け

Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation

要旨

Support