파운데이션 모델 임베딩의 인터페이싱

초록

우리는 파운데이션 모델의 임베딩을 정렬하기 위한 일반화된 인터페이스인 FIND를 제안합니다. 티저 그림에서 보여지듯이, 파운데이션 모델의 가중치를 튜닝하지 않고도 경량 트랜스포머 인터페이스만으로 통합된 이미지(세분화) 및 데이터셋 수준(검색) 이해가 가능합니다. 제안된 인터페이스는 다음과 같은 유리한 특성을 가지고 있습니다: (1) 일반화 가능성. 동일한 아키텍처와 가중치로 검색, 세분화 등 다양한 작업에 적용 가능합니다. (2) 프로토타이핑 가능성. 다양한 작업은 주의 마스크와 임베딩 유형을 프로토타이핑하여 구현할 수 있습니다. (3) 확장 가능성. 제안된 인터페이스는 새로운 작업과 새로운 모델에 적응 가능합니다. (4) 인터리빙 가능성. 다중 작업 다중 모달 학습의 이점을 통해, 제안된 인터페이스는 인터리빙된 공유 임베딩 공간을 생성합니다. 이 인터리빙된 임베딩 공간을 바탕으로, 우리는 COCO 데이터셋에 인터리브 세분화 및 검색을 위한 새로운 훈련 및 평가 주석을 도입한 FIND-Bench를 소개합니다. 우리의 접근 방식은 FIND-Bench에서 최첨단 성능을 달성하고, 표준 검색 및 세분화 설정에서도 경쟁력 있는 성능을 보입니다. 훈련, 평가, 데모 코드 및 데이터셋은 https://github.com/UX-Decoder/FIND에서 공개되었습니다.

English

We present FIND, a generalized interface for aligning foundation models' embeddings. As shown in teaser figure, a lightweight transformer interface without tuning any foundation model weights is enough for a unified image (segmentation) and dataset-level (retrieval) understanding. The proposed interface has the following favorable attributes: (1) Generalizable. It applies to various tasks spanning retrieval, segmentation, etc., under the same architecture and weights. (2) Prototypable. Different tasks are able to be implemented through prototyping attention masks and embedding types. (3) Extendable. The proposed interface is adaptive to new tasks, and new models. (4) Interleavable. With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. In light of the interleaved embedding space, we introduce the FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleave segmentation and retrieval. Our approach achieves state-of-the-art performance on FIND-Bench and competitive performance on standard retrieval and segmentation settings. The training, evaluation, and demo code as well as the dataset have been released at https://github.com/UX-Decoder/FIND.

파운데이션 모델 임베딩의 인터페이싱

Interfacing Foundation Models' Embeddings

초록

Support