可操控視覺表徵

摘要

諸如DINOv2與MAE等預訓練視覺Transformer（ViT）能提供適用於檢索、分類及分割等多種下游任務的通用圖像特徵。然而，此類表徵往往聚焦於圖像中最顯著的視覺線索，無法引導其關注較不突出的目標概念。相比之下，多模態大型語言模型雖可透過文字提示引導，但其生成的表徵易偏向語言中心主義，在通用視覺任務中的效能會減損。為解決此問題，我們提出「可引導視覺表徵」——一種新型視覺表徵體系，其全局與局部特徵均可透過自然語言進行引導。有別於多數視覺-語言模型（如CLIP）在編碼後才融合文字與視覺特徵（晚期融合），我們透過輕量級交叉注意力機制，將文字直接注入視覺編碼器的各層中（早期融合）。我們建立了衡量表徵可引導性的基準測試，並證明我們的可引導視覺特徵能專注於圖像中任意指定物體，同時保持底層表徵品質。此外，本方法在異常檢測與個性化物體辨識任務上達到或超越專用方案的性能，展現出對分佈外任務的零樣本泛化能力。

English

Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.