이미지는 가변 길이 표현의 가치를 지닌다

초록

대부분의 기존 비전 인코더는 이미지를 고정 길이의 토큰 시퀀스로 매핑하며, 서로 다른 이미지가 다양한 양의 정보를 포함한다는 사실을 간과한다. 예를 들어, 시각적으로 복잡한 이미지(예: 어수선한 방)는 본질적으로 더 많은 정보를 담고 있으므로 단순한 이미지(예: 빈 벽)보다 더 많은 토큰을 할당받아야 한다. 이러한 비효율성을 해결하기 위해, 우리는 각 이미지를 재구성하기 위해 가변적인 수의 시각적 토큰(즉, 연속적인 표현 벡터)을 생성하는 동적 비전 인코더인 DOVE를 제안한다. 우리의 실험 결과는 DOVE가 높은 재구성 품질을 유지하면서 평균 토큰 수를 크게 줄인다는 것을 보여준다. 여러 선형 탐색 및 다운스트림 멀티모달 작업에서, DOVE는 고정 길이 인코딩에 비해 훨씬 적은 수의 토큰을 사용하면서도 기존의 오토인코더 기반 토큰화 방법을 능가하며, 더 표현력 있는 의미적 특징을 포착한다. 우리는 또한 DOVE를 쿼리 조건부 토큰화로 확장한다. 모델이 쿼리와 관련된 영역에 집중하도록 유도함으로써, 더 효율적이고 목표 지향적인 의미 추출을 달성한다. 우리의 코드와 체크포인트는 https://dove-encoder.github.io/dove-encoder에서 확인할 수 있다.

English

Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction. Our code and checkpoints are available at https://dove-encoder.github.io/dove-encoder.

이미지는 가변 길이 표현의 가치를 지닌다

Images are Worth Variable Length of Representations

초록

Support