ChatPaper.aiChatPaper

VisionThink:通过强化学习实现的智能高效视觉语言模型

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

July 17, 2025
作者: Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia
cs.AI

摘要

近期,视觉-语言模型(VLMs)通过增加视觉标记的数量提升了性能,这些视觉标记通常远长于文本标记。然而,我们观察到,在大多数现实场景中,并不需要如此大量的视觉标记。尽管在少数OCR相关任务中性能显著下降,但在仅使用1/4分辨率的情况下,模型在大多数其他通用视觉问答(VQA)任务中仍能准确执行。因此,我们提出了一种动态处理不同样本分辨率的新方法,并引入了一种新的视觉标记压缩范式——VisionThink。该范式从下采样图像开始,智能判断其是否足以解决问题。若不足,模型可输出特殊标记以请求更高分辨率图像。与现有采用固定剪枝比例或阈值压缩标记的高效VLM方法相比,VisionThink能够根据具体情况自主决定是否压缩标记。结果显示,它在OCR相关任务上展现了强大的细粒度视觉理解能力,同时在更简单任务上节省了大量视觉标记。我们采用强化学习,并提出了LLM-as-Judge策略,成功将强化学习应用于通用VQA任务。此外,我们精心设计了奖励函数和惩罚机制,以实现稳定且合理的图像缩放调用比例。大量实验验证了我们方法的优越性、效率及有效性。代码已发布于https://github.com/dvlab-research/VisionThink。
English
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.
PDF481July 18, 2025