하위 객체 수준 이미지 토큰화

초록

Transformer 기반의 시각 모델은 일반적으로 이미지를 고정 크기의 정사각형 패치로 토큰화하여 입력 단위로 사용하는데, 이는 이미지 내용에 대한 적응성이 부족하며 고유한 픽셀 그룹화 구조를 간과합니다. 언어 모델에서 널리 채택된 서브워드 토큰화에서 영감을 받아, 우리는 서브오브젝트 수준의 이미지 토큰화기를 제안합니다. 여기서 서브오브젝트는 세그멘테이션 모델(예: Segment Anything 모델)을 통해 얻은 의미론적으로 의미 있는 이미지 세그먼트로 표현됩니다. 서브오브젝트 토큰화를 기반으로 한 학습 시스템을 구현하기 위해, 우리는 먼저 다양한 크기와 형태의 서브오브젝트 세그먼트를 컴팩트한 임베딩 벡터로 압축하기 위해 시퀀스-투-시퀀스 오토인코더(SeqAE)를 도입했습니다. 그런 다음 서브오브젝트 임베딩을 대형 언어 모델에 입력하여 시각 언어 학습을 수행했습니다. 실험 결과는 우리의 서브오브젝트 수준 토큰화가 전통적인 패치 수준 토큰화에 비해 이미지를 객체 및 속성 설명으로 변환하는 학습을 효율적으로 촉진함을 보여주었습니다. 코드와 모델은 https://github.com/ChenDelong1999/subobjects에서 공개될 예정입니다.

English

Transformer-based vision models typically tokenize images into fixed-size square patches as input units, which lacks the adaptability to image content and overlooks the inherent pixel grouping structure. Inspired by the subword tokenization widely adopted in language models, we propose an image tokenizer at a subobject level, where the subobjects are represented by semantically meaningful image segments obtained by segmentation models (e.g., segment anything models). To implement a learning system based on subobject tokenization, we first introduced a Sequence-to-sequence AutoEncoder (SeqAE) to compress subobject segments of varying sizes and shapes into compact embedding vectors, then fed the subobject embeddings into a large language model for vision language learning. Empirical results demonstrated that our subobject-level tokenization significantly facilitates efficient learning of translating images into object and attribute descriptions compared to the traditional patch-level tokenization. Codes and models will be open-sourced at https://github.com/ChenDelong1999/subobjects.

하위 객체 수준 이미지 토큰화

Subobject-level Image Tokenization

초록

Support