ViQ：任意分辨率下的文本对齐视觉量化表示

摘要

对文本与视觉的统一表征是自然的研究追求，因为它能实现更简洁的多模态建模和更高效的训练。然而，以与文本相同的方式将图像表示为离散信号不可避免地会导致严重的信息损失。现有研究在平衡离散表示中的低层细节与高层语义时面临困境：面向重建的表示往往缺乏语义信息，而语义更强的特征则通常遭受严重的细节丢失。我们提出ViQ（视觉量化表示框架），旨在离散表示中平衡语义与细节，同时支持原生分辨率输入，从而使其能够作为任意视觉输入的统一通用离散表示。我们的方法将量化学习划分为两个阶段：文本对齐预训练和特征离散化。通过文本对齐预训练，我们利用预训练语言模型增强视觉编码器的语义丰富监督，并使其能够处理原生分辨率视觉输入。在离散化过程中，我们提出一种近端表示学习策略以逐步压缩特征空间，同时引入位置感知的头级量化机制，使其能够灵活处理任意分辨率。多模态任务的广泛实验表明，与采用连续高维视觉特征的最先进多模态视觉编码器相比，ViQ在保持低层重建高精度的同时实现了具有竞争力的性能。我们还证明，使用视觉量化表示进行多模态训练可大幅提升效率，在不同基础大语言模型和训练策略下可实现20%至70%的加速。

English

A unified representation for text and vision is a natural pursuit, as it enables simpler multimodal modeling and more efficient training. However, representing images as discrete signals in the same way as text inevitably introduces severe information loss. Existing work struggles to balance low-level details and high-level semantics in discrete representations: reconstruction-oriented representations often lack semantic information, whereas semantically stronger features typically suffer from severe loss of detail. We present ViQ, a Visual Quantized Representations framework, which is designed to balance semantics and details in discrete representations while supporting inputs at native resolutions, thereby enabling it to serve as a unified and general discrete representation for arbitrary visual inputs. Our approach structures quantization learning into two stages: text-aligned pre-training and feature discretization. With text-aligned pre-training, we enhance the visual encoder semantic-rich supervision from the pretrained language model and enable it to process native-resolution visual inputs. During discretization, we propose a proximal representation learning strategy to progressively compact the feature space, along with a position-aware head-wise quantization mechanism that enables flexible processing of arbitrary resolutions. Extensive experiments on multimodal tasks demonstrate that ViQ achieves competitive performance compared to state-of-the-art multimodal vision encoders with continuous and high-dimensional visual features, while maintaining high precision in low-level reconstruction. We also show that multimodal training with visual quantized representations largely improves efficiency, yielding up to 20\%-70\% acceleration with different base LLMs and training recipes.