基础模型嵌入的接口化
Interfacing Foundation Models' Embeddings
December 12, 2023
作者: Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang
cs.AI
摘要
我们提出了FIND,一个用于对齐基础模型嵌入的通用接口。如预告图所示,一个轻量级的Transformer接口,无需调整任何基础模型的权重,就足以实现统一的图像(分割)和数据集级(检索)理解。所提出的接口具有以下有利特性:(1)通用性。适用于跨越检索、分割等各种任务,在相同的架构和权重下。 (2)可原型化。通过原型化注意力掩模和嵌入类型,不同任务能够被实现。 (3)可扩展。所提出的接口适应新任务和新模型。 (4)可交错。借助多任务多模态训练的好处,所提出的接口创建了一个交错共享的嵌入空间。鉴于交错的嵌入空间,我们引入了FIND-Bench,为COCO数据集引入了新的训练和评估注释,以进行交错分割和检索。我们的方法在FIND-Bench上实现了最先进的性能,并在标准检索和分割设置上取得了竞争性能。训练、评估和演示代码以及数据集已发布在https://github.com/UX-Decoder/FIND。
English
We present FIND, a generalized interface for aligning foundation models'
embeddings. As shown in teaser figure, a lightweight transformer interface
without tuning any foundation model weights is enough for a unified image
(segmentation) and dataset-level (retrieval) understanding. The proposed
interface has the following favorable attributes: (1) Generalizable. It applies
to various tasks spanning retrieval, segmentation, etc., under the
same architecture and weights. (2) Prototypable. Different tasks are able to be
implemented through prototyping attention masks and embedding types. (3)
Extendable. The proposed interface is adaptive to new tasks, and new models.
(4) Interleavable. With the benefit of multi-task multi-modal training, the
proposed interface creates an interleaved shared embedding space. In light of
the interleaved embedding space, we introduce the FIND-Bench, which introduces
new training and evaluation annotations to the COCO dataset for interleave
segmentation and retrieval. Our approach achieves state-of-the-art performance
on FIND-Bench and competitive performance on standard retrieval and
segmentation settings. The training, evaluation, and demo code as well as the
dataset have been released at https://github.com/UX-Decoder/FIND.