ViCO:邁向語義感知動態高解析度的訓練策略
ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
October 14, 2025
作者: Long Cui, Weiyun Wang, Jie Shao, Zichen Wen, Gen Luo, Linfeng Zhang, Yanting Zhang, Yu Qiao, Wenhai Wang
cs.AI
摘要
現有的多模態大型語言模型(MLLMs)由於圖像輸入引入的額外視覺標記而面臨推理成本增加的問題。在本研究中,我們提出了一種新穎的訓練算法——視覺一致性學習(ViCO),該算法使模型能夠根據圖像的語義複雜性使用不同數量的視覺標記來表示圖像。我們方法的核心理念是採用多個具有不同圖像壓縮比的多層感知器(MLP)連接器,根據圖像的語義複雜性對視覺標記進行下采樣。在訓練過程中,我們最小化基於不同MLP連接器條件下的響應之間的KL散度。在推理階段,我們引入了一個稱為視覺分辨率路由器(ViR)的圖像路由機制,它能自動為每個圖像塊選擇合適的壓縮率。與現有的基於圖像分辨率調整視覺標記數量的動態高分辨率策略相比,我們的方法根據語義複雜性動態調整視覺標記的數量。實驗結果表明,我們的方法能夠在保持模型感知、推理和OCR能力的同時,將視覺標記的數量減少高達50%。我們希望這項工作能為開發更高效的MLLMs做出貢獻。代碼和模型將被公開,以促進未來的研究。
English
Existing Multimodal Large Language Models (MLLMs) suffer from increased
inference costs due to the additional vision tokens introduced by image inputs.
In this work, we propose Visual Consistency Learning (ViCO), a novel training
algorithm that enables the model to represent images of varying semantic
complexities using different numbers of vision tokens. The key idea behind our
method is to employ multiple MLP connectors, each with a different image
compression ratio, to downsample the vision tokens based on the semantic
complexity of the image. During training, we minimize the KL divergence between
the responses conditioned on different MLP connectors. At inference time, we
introduce an image router, termed Visual Resolution Router (ViR), that
automatically selects the appropriate compression rate for each image patch.
Compared with existing dynamic high-resolution strategies, which adjust the
number of visual tokens based on image resolutions, our method dynamically
adapts the number of visual tokens according to semantic complexity.
Experimental results demonstrate that our method can reduce the number of
vision tokens by up to 50% while maintaining the model's perception, reasoning,
and OCR capabilities. We hope this work will contribute to the development of
more efficient MLLMs. The code and models will be released to facilitate future
research.