超越标量距离：基于冻结多模态大语言模型的语义属性梯度用于视觉嵌入

摘要

用于检索的视觉编码器通常通过类别标签监督进行训练：每个训练样本对简化为一个标量，统一地推开或拉近嵌入向量，仿佛所有视觉属性要么不同要么匹配。而多模态大语言模型（MLLM）面对同一对图像时，能言明这些属性，并据此预测图像是否属于同一类别。我们提出SAGA框架，将这种基于语言、感知属性的能力转化为编码器自身的训练信号。具体而言，我们利用群体相对策略优化（GRPO）对MLLM基于视觉编码器令牌的正确预测进行奖励。由于正确预测要求这些令牌暴露图像对之间具体差异或匹配的属性，梯度推动编码器编码这些属性，将统一的样本对层级标量替换为属性解析式监督。一种辅助注意力蒸馏损失将编码器的嵌入锚定到MLLM所关注的令牌上，而标准度量学习损失则塑造用于最近邻检索的嵌入几何结构。MLLM全程冻结，推理阶段丢弃，部署成本与度量学习基线相同。在CUB-200-2011、Cars-196、FGVC-Aircraft和iNaturalist Aves数据集上的零样本图像检索任务中，SAGA将Recall@1相比最先进基线提升了3至6个百分点。

English

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose SAGA, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.