基盤モデルの埋め込みのインターフェース

要旨

我々は、基盤モデルの埋め込みを整列させるための汎用インターフェースであるFINDを提案する。ティーザー図に示すように、基盤モデルの重みを調整することなく、軽量なトランスフォーマーインターフェースを用いることで、画像（セグメンテーション）とデータセットレベル（検索）の統一的な理解が可能である。提案するインターフェースは以下のような利点を持つ：(1) 汎用性。同一のアーキテクチャと重みで、検索、セグメンテーションなど様々なタスクに適用可能。(2) プロトタイプ化可能。異なるタスクは、アテンションマスクと埋め込みタイプのプロトタイピングを通じて実装可能。(3) 拡張性。提案するインターフェースは新しいタスクやモデルに適応可能。(4) インターリーブ可能。マルチタスク・マルチモーダルトレーニングの利点を活かし、提案するインターフェースはインターリーブされた共有埋め込み空間を生成する。このインターリーブされた埋め込み空間に基づき、我々はFIND-Benchを導入し、COCOデータセットにインターリーブセグメンテーションと検索のための新しいトレーニングおよび評価アノテーションを追加した。我々のアプローチは、FIND-Benchにおいて最先端の性能を達成し、標準的な検索およびセグメンテーション設定においても競争力のある性能を示す。トレーニング、評価、デモコードおよびデータセットはhttps://github.com/UX-Decoder/FINDで公開されている。

English

We present FIND, a generalized interface for aligning foundation models' embeddings. As shown in teaser figure, a lightweight transformer interface without tuning any foundation model weights is enough for a unified image (segmentation) and dataset-level (retrieval) understanding. The proposed interface has the following favorable attributes: (1) Generalizable. It applies to various tasks spanning retrieval, segmentation, etc., under the same architecture and weights. (2) Prototypable. Different tasks are able to be implemented through prototyping attention masks and embedding types. (3) Extendable. The proposed interface is adaptive to new tasks, and new models. (4) Interleavable. With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. In light of the interleaved embedding space, we introduce the FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleave segmentation and retrieval. Our approach achieves state-of-the-art performance on FIND-Bench and competitive performance on standard retrieval and segmentation settings. The training, evaluation, and demo code as well as the dataset have been released at https://github.com/UX-Decoder/FIND.

基盤モデルの埋め込みのインターフェース

Interfacing Foundation Models' Embeddings

要旨

Support