子物件級別的影像標記化

摘要

基於 Transformer 的視覺模型通常將影像分詞為固定大小的方形區塊作為輸入單元，這種做法缺乏對影像內容的適應性，並忽略了內在的像素分組結構。受語言模型廣泛採用的次詞分詞啟發，我們提出了一種在子物件級別上進行影像分詞的方法，其中子物件由通過分割模型（例如，segment anything 模型）獲得的語義上有意義的影像片段表示。為了實現基於子物件分詞的學習系統，我們首先引入了一個序列到序列自編碼器（SeqAE），將不同大小和形狀的子物件片段壓縮為緊湊的嵌入向量，然後將子物件嵌入輸入到大型語言模型中進行視覺語言學習。實證結果表明，相較於傳統的區塊級別分詞，我們的子物件級別分詞顯著促進了將影像翻譯為物件和屬性描述的高效學習。代碼和模型將在 https://github.com/ChenDelong1999/subobjects 上開源。

English

Transformer-based vision models typically tokenize images into fixed-size square patches as input units, which lacks the adaptability to image content and overlooks the inherent pixel grouping structure. Inspired by the subword tokenization widely adopted in language models, we propose an image tokenizer at a subobject level, where the subobjects are represented by semantically meaningful image segments obtained by segmentation models (e.g., segment anything models). To implement a learning system based on subobject tokenization, we first introduced a Sequence-to-sequence AutoEncoder (SeqAE) to compress subobject segments of varying sizes and shapes into compact embedding vectors, then fed the subobject embeddings into a large language model for vision language learning. Empirical results demonstrated that our subobject-level tokenization significantly facilitates efficient learning of translating images into object and attribute descriptions compared to the traditional patch-level tokenization. Codes and models will be open-sourced at https://github.com/ChenDelong1999/subobjects.

子物件級別的影像標記化

Subobject-level Image Tokenization

摘要

Support