圖像值得變長度的表徵

摘要

现有的大多数视觉编码器将图像映射为固定长度的标记序列，忽视了不同图像所含信息量各异的事实。例如，视觉上复杂的图像（如杂乱的房间）天然携带更多信息，因此应比简单图像（如空白的墙壁）分配更多的标记。针对这一效率低下的问题，我们提出了DOVE，一种动态视觉编码器，它生成可变数量的视觉标记（即连续表示向量）来重建每幅图像。我们的结果表明，DOVE在保持高重建质量的同时，显著减少了平均标记数量。在多项线性探测及下游多模态任务中，相较于固定长度编码，DOVE在使用远少标记的情况下，超越了现有的基于自编码器的标记化方法，捕捉到了更具表现力的语义特征。我们进一步扩展了DOVE，引入了查询条件化标记化技术。通过引导模型聚焦于与查询相关的区域，实现了更高效且有针对性的语义提取。我们的代码及检查点可在https://dove-encoder.github.io/dove-encoder获取。

English

Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction. Our code and checkpoints are available at https://dove-encoder.github.io/dove-encoder.

圖像值得變長度的表徵

Images are Worth Variable Length of Representations

摘要

Support