AdaptVision：通过自适应视觉获取实现高效视觉语言模型

摘要

视觉语言模型（VLM）在视觉问答任务中取得了显著成功，但其对大量视觉标记的依赖带来了显著的计算开销。现有高效VLM方法虽能通过固定比例压缩减少视觉标记，但这类被动操作缺乏适应不同任务需求的能力。这引出一个根本性问题：VLM能否自主确定每个样本所需的最小视觉标记数量？受人类主动视觉机制启发，我们提出AdaptVision——一种通过由粗到精方式实现自适应视觉标记获取的高效VLM范式。该模型首先处理来自低分辨率图像的压缩视觉标记，并在必要时通过调用边界框工具裁剪关键区域来选择性获取额外视觉信息。我们采用强化学习框架训练AdaptVision，精心平衡准确性与效率。其核心是解耦轮次策略优化（DTPO），该算法将学习目标解耦为两个组件：（1）工具学习——优化正确工具使用能力；（2）精度提升——优化生成响应以提高答案正确性。基于此框架，我们通过计算各目标对应标记的独立优势值进一步解耦优势估计。相较于原始GRPO，该设计能为AdaptVision实现更有效的优化。在多组VQA基准上的综合实验表明，AdaptVision在消耗视觉标记数量显著少于现有高效VLM方法的同时，实现了更优越的性能。

English

Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.

AdaptVision：通过自适应视觉获取实现高效视觉语言模型

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

摘要

Support