AdaptVision：通过自适应视觉获取实现高效视觉语言模型

摘要

视觉语言模型（VLM）在视觉问答任务中取得了显著成功，但其对大量视觉标记的依赖带来了巨大的计算开销。虽然现有高效VLM方法通过固定比例压缩减少视觉标记，但这些方法属于被动操作，缺乏适应不同任务需求的能力。这引出了一个根本性问题：VLM能否自主确定每个样本所需的最小视觉标记数量？受人类主动视觉机制启发，我们提出了AdaptVision——一种通过由粗到精方式实现自适应视觉标记获取的高效VLM范式。该模型首先处理来自低分辨率图像的压缩视觉标记，并在必要时通过调用边界框工具裁剪关键区域来选择性获取额外视觉信息。我们采用强化学习框架训练AdaptVision，精心平衡准确性与效率。方法的核心是解耦轮次策略优化（DTPO），它将学习目标分解为两个部分：（1）工具学习——优化正确使用工具的能力；（2）精度提升——优化生成响应以提高答案正确性。基于此框架，我们进一步通过计算各目标对应标记的独立优势值来实现优势估计的解耦。与原始GRPO相比，该框架能为AdaptVision实现更有效的优化。在多组VQA基准测试中的综合实验表明，相较于现有高效VLM方法，AdaptVision在消耗显著更少视觉标记的同时实现了更优的性能。

English

Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.

AdaptVision：通过自适应视觉获取实现高效视觉语言模型

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

摘要

Support