圖像值得變長度的表徵
Images are Worth Variable Length of Representations
June 4, 2025
作者: Lingjun Mao, Rodolfo Corona, Xin Liang, Wenhao Yan, Zineng Tang
cs.AI
摘要
现有的大多数视觉编码器将图像映射为固定长度的标记序列,忽视了不同图像所含信息量各异的事实。例如,视觉上复杂的图像(如杂乱的房间)天然携带更多信息,因此应比简单图像(如空白的墙壁)分配更多的标记。针对这一效率低下的问题,我们提出了DOVE,一种动态视觉编码器,它生成可变数量的视觉标记(即连续表示向量)来重建每幅图像。我们的结果表明,DOVE在保持高重建质量的同时,显著减少了平均标记数量。在多项线性探测及下游多模态任务中,相较于固定长度编码,DOVE在使用远少标记的情况下,超越了现有的基于自编码器的标记化方法,捕捉到了更具表现力的语义特征。我们进一步扩展了DOVE,引入了查询条件化标记化技术。通过引导模型聚焦于与查询相关的区域,实现了更高效且有针对性的语义提取。我们的代码及检查点可在https://dove-encoder.github.io/dove-encoder获取。
English
Most existing vision encoders map images into a fixed-length sequence of
tokens, overlooking the fact that different images contain varying amounts of
information. For example, a visually complex image (e.g., a cluttered room)
inherently carries more information and thus deserves more tokens than a simple
image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a
dynamic vision encoder that produces a variable number of visual tokens (i.e.,
continuous representation vectors) to reconstruct each image. Our results show
that DOVE significantly reduces the average number of tokens while maintaining
high reconstruction quality. In several linear probing and downstream
multimodal tasks, it outperforms existing autoencoder-based tokenization
methods when using far fewer tokens, capturing more expressive semantic
features compared to fixed-length encoding. We further extend DOVE with
query-conditioned tokenization. By guiding the model to focus on query-relevant
regions, it achieves more efficient and targeted semantic extraction. Our code
and checkpoints are available at https://dove-encoder.github.io/dove-encoder.