ViQ: 任意解像度におけるテキスト整合型視覚量子化表現

要旨

テキストと視覚に対する統一表現は、よりシンプルなマルチモーダルモデリングと効率的な学習を可能にするため、自然な追求である。しかし、画像をテキストと同じように離散信号として表現することは、必然的に深刻な情報損失を引き起こす。既存の研究は、離散表現における低レベルの詳細と高レベルの意味論のバランスを取ることに苦慮している。再構成型の表現は往々にして意味情報が不足し、一方で意味的に強い特徴量は詳細の深刻な損失に悩まされる。本稿では、ViQ（Visual Quantized Representations）フレームワークを提案する。これは、離散表現における意味論と詳細のバランスを図りながら、ネイティブ解像度の入力をサポートすることで、任意の視覚入力に対する統一された汎用離散表現として機能することを可能にする。本手法は、量子化学習をテキスト整合事前学習と特徴量離散化の2段階に構造化する。テキスト整合事前学習により、事前学習済み言語モデルからの意味的に豊かな教師信号を活用して視覚エンコーダを強化し、ネイティブ解像度の視覚入力を処理できるようにする。離散化の段階では、特徴空間を徐々に圧縮するための近接表現学習戦略と、任意の解像度を柔軟に処理可能な位置認識ヘッド単位量子化機構を提案する。マルチモーダルタスクにおける広範な実験により、ViQは低レベルの再構成において高い精度を維持しつつ、連続かつ高次元な視覚特徴量を持つ最先端のマルチモーダル視覚エンコーダと競争力のある性能を達成することを実証した。また、視覚量子化表現を用いたマルチモーダル学習は効率性を大幅に向上させ、異なるベースLLMや学習レシピにおいて最大20%～70%の高速化を実現することを示す。

English

A unified representation for text and vision is a natural pursuit, as it enables simpler multimodal modeling and more efficient training. However, representing images as discrete signals in the same way as text inevitably introduces severe information loss. Existing work struggles to balance low-level details and high-level semantics in discrete representations: reconstruction-oriented representations often lack semantic information, whereas semantically stronger features typically suffer from severe loss of detail. We present ViQ, a Visual Quantized Representations framework, which is designed to balance semantics and details in discrete representations while supporting inputs at native resolutions, thereby enabling it to serve as a unified and general discrete representation for arbitrary visual inputs. Our approach structures quantization learning into two stages: text-aligned pre-training and feature discretization. With text-aligned pre-training, we enhance the visual encoder semantic-rich supervision from the pretrained language model and enable it to process native-resolution visual inputs. During discretization, we propose a proximal representation learning strategy to progressively compact the feature space, along with a position-aware head-wise quantization mechanism that enables flexible processing of arbitrary resolutions. Extensive experiments on multimodal tasks demonstrate that ViQ achieves competitive performance compared to state-of-the-art multimodal vision encoders with continuous and high-dimensional visual features, while maintaining high precision in low-level reconstruction. We also show that multimodal training with visual quantized representations largely improves efficiency, yielding up to 20\%-70\% acceleration with different base LLMs and training recipes.