サブオブジェクトレベル画像トークン化

要旨

Transformerベースの視覚モデルは通常、画像を固定サイズの正方形パッチとしてトークン化し、入力単位とします。しかし、この方法では画像内容への適応性が欠如し、ピクセルのグループ構造を十分に考慮していません。言語モデルで広く採用されているサブワードトークン化に着想を得て、我々はサブオブジェクトレベルでの画像トークナイザーを提案します。ここで、サブオブジェクトはセグメンテーションモデル（例：Segment Anythingモデル）によって得られる意味的に有意義な画像セグメントとして表現されます。サブオブジェクトトークン化に基づく学習システムを実装するため、まず、様々なサイズや形状のサブオブジェクトセグメントをコンパクトな埋め込みベクトルに圧縮するSequence-to-sequence AutoEncoder（SeqAE）を導入しました。その後、サブオブジェクトの埋め込みを大規模言語モデルに入力し、視覚言語学習を行いました。実験結果から、従来のパッチレベルトークン化と比較して、サブオブジェクトレベルのトークン化が画像をオブジェクトや属性の記述に変換する効率的な学習を大幅に促進することが示されました。コードとモデルはhttps://github.com/ChenDelong1999/subobjectsで公開予定です。

English

Transformer-based vision models typically tokenize images into fixed-size square patches as input units, which lacks the adaptability to image content and overlooks the inherent pixel grouping structure. Inspired by the subword tokenization widely adopted in language models, we propose an image tokenizer at a subobject level, where the subobjects are represented by semantically meaningful image segments obtained by segmentation models (e.g., segment anything models). To implement a learning system based on subobject tokenization, we first introduced a Sequence-to-sequence AutoEncoder (SeqAE) to compress subobject segments of varying sizes and shapes into compact embedding vectors, then fed the subobject embeddings into a large language model for vision language learning. Empirical results demonstrated that our subobject-level tokenization significantly facilitates efficient learning of translating images into object and attribute descriptions compared to the traditional patch-level tokenization. Codes and models will be open-sourced at https://github.com/ChenDelong1999/subobjects.

サブオブジェクトレベル画像トークン化

Subobject-level Image Tokenization

要旨

Support