Interfacciamento delle Embedding dei Modelli Fondamentali

Abstract

Presentiamo FIND, un'interfaccia generalizzata per l'allineamento degli embedding dei modelli di base. Come mostrato nella figura introduttiva, un'interfaccia transformer leggera senza la necessità di ottimizzare i pesi del modello di base è sufficiente per una comprensione unificata a livello di immagine (segmentazione) e di dataset (recupero). L'interfaccia proposta presenta i seguenti attributi favorevoli: (1) Generalizzabile. Si applica a varie attività che spaziano dal recupero alla segmentazione, ecc., mantenendo la stessa architettura e gli stessi pesi. (2) Prototipabile. Diverse attività possono essere implementate attraverso la prototipazione di maschere di attenzione e tipi di embedding. (3) Estendibile. L'interfaccia proposta è adattabile a nuove attività e nuovi modelli. (4) Intervallabile. Grazie al vantaggio dell'addestramento multi-task e multi-modale, l'interfaccia proposta crea uno spazio di embedding condiviso intervallato. Alla luce dello spazio di embedding intervallato, introduciamo FIND-Bench, che aggiunge nuove annotazioni di addestramento e valutazione al dataset COCO per la segmentazione e il recupero intervallati. Il nostro approccio raggiunge prestazioni all'avanguardia su FIND-Bench e prestazioni competitive nelle impostazioni standard di recupero e segmentazione. Il codice di addestramento, valutazione e demo, nonché il dataset, sono stati rilasciati su https://github.com/UX-Decoder/FIND.

English

We present FIND, a generalized interface for aligning foundation models' embeddings. As shown in teaser figure, a lightweight transformer interface without tuning any foundation model weights is enough for a unified image (segmentation) and dataset-level (retrieval) understanding. The proposed interface has the following favorable attributes: (1) Generalizable. It applies to various tasks spanning retrieval, segmentation, etc., under the same architecture and weights. (2) Prototypable. Different tasks are able to be implemented through prototyping attention masks and embedding types. (3) Extendable. The proposed interface is adaptive to new tasks, and new models. (4) Interleavable. With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. In light of the interleaved embedding space, we introduce the FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleave segmentation and retrieval. Our approach achieves state-of-the-art performance on FIND-Bench and competitive performance on standard retrieval and segmentation settings. The training, evaluation, and demo code as well as the dataset have been released at https://github.com/UX-Decoder/FIND.

Interfacciamento delle Embedding dei Modelli Fondamentali

Interfacing Foundation Models' Embeddings

Abstract

Support