ChatPaper.aiChatPaper

ViCO:一种面向语义感知动态高分辨率的训练策略

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

October 14, 2025
作者: Long Cui, Weiyun Wang, Jie Shao, Zichen Wen, Gen Luo, Linfeng Zhang, Yanting Zhang, Yu Qiao, Wenhai Wang
cs.AI

摘要

现有的多模态大语言模型(MLLMs)因图像输入引入的额外视觉标记而面临推理成本增加的问题。在本研究中,我们提出了一种新颖的训练算法——视觉一致性学习(ViCO),该算法使模型能够根据不同语义复杂度使用不同数量的视觉标记来表示图像。我们方法的核心在于采用多个具有不同图像压缩比的多层感知器(MLP)连接器,根据图像的语义复杂度对视觉标记进行下采样。在训练过程中,我们最小化基于不同MLP连接器条件响应的KL散度。在推理时,我们引入了一个称为视觉分辨率路由器(ViR)的图像路由机制,它能自动为每个图像块选择合适的压缩率。与现有基于图像分辨率动态调整视觉标记数量的高分辨率策略相比,我们的方法根据语义复杂度动态调整视觉标记数量。实验结果表明,我们的方法在保持模型感知、推理和OCR能力的同时,最多可减少50%的视觉标记。我们希望这项工作能为开发更高效的MLLMs做出贡献。代码和模型将公开发布,以促进未来研究。
English
Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs. In this work, we propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying semantic complexities using different numbers of vision tokens. The key idea behind our method is to employ multiple MLP connectors, each with a different image compression ratio, to downsample the vision tokens based on the semantic complexity of the image. During training, we minimize the KL divergence between the responses conditioned on different MLP connectors. At inference time, we introduce an image router, termed Visual Resolution Router (ViR), that automatically selects the appropriate compression rate for each image patch. Compared with existing dynamic high-resolution strategies, which adjust the number of visual tokens based on image resolutions, our method dynamically adapts the number of visual tokens according to semantic complexity. Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model's perception, reasoning, and OCR capabilities. We hope this work will contribute to the development of more efficient MLLMs. The code and models will be released to facilitate future research.
PDF22October 15, 2025