画像は可変長の表現に値する

要旨

既存の視覚エンコーダの多くは、画像を固定長のトークン列にマッピングするが、異なる画像が異なる量の情報を含むという事実を見落としている。例えば、視覚的に複雑な画像（例：散らかった部屋）は、単純な画像（例：真っ白な壁）よりも本質的に多くの情報を有しており、それゆえより多くのトークンを割り当てる価値がある。この非効率性に対処するため、我々はDOVEを提案する。DOVEは、各画像を再構築するために可変数の視覚トークン（すなわち、連続的な表現ベクトル）を生成する動的視覚エンコーダである。我々の結果は、DOVEが高い再構築品質を維持しながら、平均トークン数を大幅に削減することを示している。いくつかの線形プロービングおよび下流のマルチモーダルタスクにおいて、DOVEは固定長エンコーディングと比較して、はるかに少ないトークンを使用しながら、既存のオートエンコーダベースのトークン化手法を上回り、より表現力豊かな意味的特徴を捉える。さらに、我々はDOVEをクエリ条件付きトークン化で拡張する。モデルにクエリ関連領域に焦点を当てるよう導くことで、より効率的でターゲットを絞った意味抽出を実現する。我々のコードとチェックポイントはhttps://dove-encoder.github.io/dove-encoderで公開されている。

English

Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction. Our code and checkpoints are available at https://dove-encoder.github.io/dove-encoder.

画像は可変長の表現に値する

Images are Worth Variable Length of Representations

要旨

Support