AdaptVision：適応的視覚獲得による効率的な視覚言語モデル

要旨

Vision-Language Models（VLM）は視覚質問応答タスクにおいて顕著な成功を収めているが、大量の視覚トークンへの依存が計算コストの重大な負担となっている。既存の効率的なVLM手法は固定比率の圧縮によって視覚トークンを削減するが、これらは受動的であり、様々なタスク要求に適応する能力を欠いている。この状況は根本的な問いを提起する：VLMは各サンプルに必要な最小限の視覚トークン数を自律的に決定できるか？人間の能動的視覚メカニズムに着想を得て、本論文では粗視化から精緻化へのアプローチによる適応的視覚トークン獲得を実現する効率的VLMパラダイム「AdaptVision」を提案する。本モデルはまず低解像度画像から圧縮された視覚トークンを処理し、必要に応じてバウンディングボックスツールを起動して重要領域を切り出すことで、追加的な視覚情報を選択的に取得する。AdaptVisionの訓練には、精度と効率性を慎重に均衡させる強化学習フレームワークを採用する。我々の手法の中核となるのは、学習目標を二要素に分離するDecoupled Turn Policy Optimization（DTPO）である：（1）正しいツール利用を最適化するツール学習、（2）回答の正確性向上のために生成応答を洗練させる精度改善。この定式化に基づき、各目標に関連するトークンに対して個別にアドバンテージを計算することで、アドバンテージ推定も分離する。この定式化により、従来のGRPOと比較してAdaptVisionのより効果的な最適化が可能となる。複数のVQAベンチマークにおける総合的な実験により、AdaptVisionが最先端の効率的VLM手法よりも大幅に少ない視覚トークン消費量で優れた性能を達成することが実証された。

English

Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.

AdaptVision：適応的視覚獲得による効率的な視覚言語モデル

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

要旨

Support