ChatPaper.aiChatPaper

基礎模型嵌入的介面化

Interfacing Foundation Models' Embeddings

December 12, 2023
作者: Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang
cs.AI

摘要

我們提出了FIND,一個用於對齊基礎模型嵌入的通用化界面。如預告圖所示,一個輕量級的Transformer界面,無需調整任何基礎模型權重,就足以實現統一的圖像(分割)和數據集級(檢索)理解。所提出的界面具有以下有利特性:(1) 可泛化。它適用於各種任務,包括檢索、分割等,在相同的架構和權重下。(2) 可原型化。通過原型化注意力遮罩和嵌入類型,不同任務能夠被實現。(3) 可擴展。所提出的界面能夠適應新任務和新模型。(4) 可交錯。借助多任務多模態訓練的好處,所提出的界面創建了一個交錯共享嵌入空間。基於交錯嵌入空間,我們引入了FIND-Bench,為COCO數據集引入了新的訓練和評估標註,用於交錯分割和檢索。我們的方法在FIND-Bench上實現了最先進的性能,並在標準檢索和分割設置上實現了競爭性的性能。訓練、評估和演示代碼以及數據集已在https://github.com/UX-Decoder/FIND 上發布。
English
We present FIND, a generalized interface for aligning foundation models' embeddings. As shown in teaser figure, a lightweight transformer interface without tuning any foundation model weights is enough for a unified image (segmentation) and dataset-level (retrieval) understanding. The proposed interface has the following favorable attributes: (1) Generalizable. It applies to various tasks spanning retrieval, segmentation, etc., under the same architecture and weights. (2) Prototypable. Different tasks are able to be implemented through prototyping attention masks and embedding types. (3) Extendable. The proposed interface is adaptive to new tasks, and new models. (4) Interleavable. With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. In light of the interleaved embedding space, we introduce the FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleave segmentation and retrieval. Our approach achieves state-of-the-art performance on FIND-Bench and competitive performance on standard retrieval and segmentation settings. The training, evaluation, and demo code as well as the dataset have been released at https://github.com/UX-Decoder/FIND.
PDF150December 15, 2024