操縦可能な視覚的表現

要旨

事前学習済みVision Transformer（ViT）であるDINOv2やMAEは、検索、分類、セグメンテーションなど様々な下流タスクに応用可能な汎用的な画像特徴を提供します。しかし、このような表現は画像内の最も顕著な視覚的手がかりに焦点を当てる傾向があり、関心のある目立たない概念に向けて方向づける手段がありません。一方、マルチモーダルLLMはテキストプロンプトで誘導できますが、得られる表現は言語中心となり、汎用的な視覚タスクにおける有効性が損なわれます。この問題に対処するため、我々は自然言語によって大域的および局所的特徴を誘導可能な新しい視覚表現のクラスであるSteerable Visual Representationsを提案します。ほとんどの視覚言語モデル（CLIPなど）がテキストと視覚特徴を符号化後に融合する（後期融合）のに対し、我々は軽量なクロスアテンションを介してテキストを視覚エンコーダの層に直接注入します（早期融合）。表現の誘導性を測定するベンチマークを導入し、提案する誘導可能な視覚特徴が基礎となる表現品質を保ちながら画像内の任意の対象物に焦点を当てられることを実証します。また、本手法は異常検出や個人化対象識別において専用手法に匹敵あるいは優れる性能を示し、分布外タスクへのゼロショット一般化能力を発揮します。

English

Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.