Voorbij Scalaire Afstanden: Semantische Attribuutgradiënten van Bevroren MLLMs voor Visuele Embeddings

Samenvatting

Visuele encoders voor terugwinning worden doorgaans getraind met supervisie op basis van klasselabels: elk trainingspaar reduceert tot een scalaire waarde die de inbedding uniform uit elkaar duwt of samen trekt, alsof elk visueel attribuut ofwel verschilt ofwel overeenkomt. Een multimodaal groot taalmodel (MLLM) dat hetzelfde paar krijgt voorgelegd, kan die attributen verwoorden en ze gebruiken om te voorspellen of de afbeeldingen een klasse delen. Wij stellen SAGA voor, een raamwerk dat deze taalgestuurde, attribuutbewuste perceptie omzet in een trainingssignaal voor de encoder zelf. Concreet gebruiken we Group Relative Policy Optimization (GRPO) om het MLLM te belonen voor correcte voorspellingen op basis van de tokens van de visuele encoder. Aangezien correcte voorspellingen vereisen dat deze tokens de specifieke attributen blootleggen die verschillen of overeenkomen tussen het paar, drijft de gradiënt de encoder aan om deze te coderen, ter vervanging van de uniforme scalaire waarde op paar niveau door een attribuut-gespecificeerde supervisie. Een hulpverlies voor aandachtsdestillatie verankert de inbedding van de encoder aan tokens waar het MLLM aandacht aan heeft besteed, en een standaard metriek-leerverlies vormt de meetkunde van de inbedding voor terugwinning van de dichtstbijzijnde buur. Het MLLM blijft gedurende het hele proces bevroren en wordt bij de inferentie verwijderd, wat overeenkomt met de implementatiekosten van een metriek-leerbasislijn. SAGA verbetert Recall@1 met 3 tot 6 punten ten opzichte van state-of-the-art basislijnen op CUB-200-2011, Cars-196, FGVC-Aircraft en iNaturalist Aves voor zero-shot beeldterugwinning.

English

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose SAGA, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.