超越標量距離：來自凍結的多模態大語言模型、用於視覺嵌入的語義屬性梯度

摘要

用於檢索的視覺編碼器通常以類別標籤監督方式訓練：每個訓練對簡化為一個純量，統一地將嵌入向量推開或拉近，彷彿所有視覺屬性若非相同便是相異。多模態大型語言模型（MLLM）在面對相同影像對時，能明確描述這些屬性，並據以預測影像是否屬於同一類別。我們提出SAGA架構，將這種基於語言、具備屬性感知能力的辨識，轉化為編碼器本身的訓練訊號。具體而言，我們採用群體相對策略最佳化（GRPO），對MLLM根據視覺編碼器的標記（tokens）作出正確預測的行為給予獎勵。由於正確預測需要這些標記展現出影像對之間具體相異或相同的屬性，梯度便會推動編碼器去編碼這些屬性，從而將原本統一的成對層級純量監督，替換為解析度更高的屬性層級監督。輔助的注意力蒸餾損失函數將編碼器的嵌入向量對齊至MLLM所關注的標記，而標準的度量學習損失函數則塑造嵌入向量的幾何結構，以利最近鄰檢索。MLLM在整個過程中保持凍結，並在推論時被捨棄，使其部署成本與基於度量學習的基準方法相當。在CUB-200-2011、Cars-196、FGVC-Aircraft及iNaturalist Aves資料集上進行零樣本影像檢索時，SAGA比當前最佳基準方法的Recall@1提升了3到6個百分點。

English

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose SAGA, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.