子对象级图像标记化

摘要

基于Transformer的视觉模型通常将图像标记为固定大小的方形补丁作为输入单元，这种方法缺乏对图像内容的适应性，忽略了固有的像素分组结构。受语言模型广泛采用的子词标记启发，我们提出了一种在子对象级别进行图像标记的方法，其中子对象由通过分割模型（例如，分割任何模型）获得的语义上有意义的图像段表示。为了基于子对象标记实现学习系统，我们首先引入了一个序列到序列自动编码器（SeqAE），将不同大小和形状的子对象段压缩为紧凑的嵌入向量，然后将子对象嵌入馈送到大型语言模型进行视觉语言学习。实证结果表明，与传统的补丁级别标记相比，我们的子对象级别标记显著促进了将图像翻译为对象和属性描述的高效学习。代码和模型将在https://github.com/ChenDelong1999/subobjects 上开源。

English

Transformer-based vision models typically tokenize images into fixed-size square patches as input units, which lacks the adaptability to image content and overlooks the inherent pixel grouping structure. Inspired by the subword tokenization widely adopted in language models, we propose an image tokenizer at a subobject level, where the subobjects are represented by semantically meaningful image segments obtained by segmentation models (e.g., segment anything models). To implement a learning system based on subobject tokenization, we first introduced a Sequence-to-sequence AutoEncoder (SeqAE) to compress subobject segments of varying sizes and shapes into compact embedding vectors, then fed the subobject embeddings into a large language model for vision language learning. Empirical results demonstrated that our subobject-level tokenization significantly facilitates efficient learning of translating images into object and attribute descriptions compared to the traditional patch-level tokenization. Codes and models will be open-sourced at https://github.com/ChenDelong1999/subobjects.

子对象级图像标记化

Subobject-level Image Tokenization

摘要

Support