再帰的な割り当てによる適応長画像トークン化

要旨

現在のビジョンシステムは通常、情報内容に関係なく画像に固定長の表現を割り当てます。これは人間の知能や大規模言語モデルとは対照的であり、エントロピー、コンテキスト、および馴染みに基づいて変動する表現能力を割り当てます。この着想に基づき、私たちは2次元画像の可変長トークン表現を学習するアプローチを提案します。私たちのエンコーダーデコーダーアーキテクチャは、再帰的に2次元画像トークンを処理し、複数の再帰ロールアウトを通じてそれらを1次元潜在トークンに蒸留します。各反復は2次元トークンを洗練し、既存の1次元潜在トークンを更新し、新しいトークンを追加することで表現能力を適応的に増やします。これにより、画像を32から256までの可変数のトークンに圧縮することが可能となります。再構成損失とFIDメトリクスを使用してトークナイザーを検証し、トークン数が画像のエントロピー、馴染み、および下流タスクの要件と一致することを示します。各反復で表現能力が増加する再帰的トークン処理により、トークンの特殊化の兆候が現れ、オブジェクト/部位の発見の可能性が示されます。

English

Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.

再帰的な割り当てによる適応長画像トークン化

Adaptive Length Image Tokenization via Recurrent Allocation

要旨

Support