VisionThink:基於強化學習的智能高效視覺語言模型
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
July 17, 2025
作者: Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia
cs.AI
摘要
近期,視覺語言模型(VLMs)的進步通過增加視覺標記的數量提升了性能,這些視覺標記通常比文本標記長得多。然而,我們觀察到,大多數現實場景並不需要如此大量的視覺標記。儘管在少數OCR相關任務中性能顯著下降,但模型在僅使用1/4分辨率的情況下,仍能在大多數其他通用視覺問答(VQA)任務中準確執行。因此,我們提出動態處理不同樣本的不同分辨率,並提出了一種新的視覺標記壓縮範式,即VisionThink。它從下采樣的圖像開始,智能地判斷是否足以解決問題。否則,模型可以輸出一個特殊標記來請求更高分辨率的圖像。與現有的高效VLM方法相比,這些方法使用固定的剪枝比例或閾值來壓縮標記,而VisionThink則根據情況自主決定是否壓縮標記。結果表明,它在OCR相關任務上展現出強大的細粒度視覺理解能力,同時在更簡單的任務上節省了大量視覺標記。我們採用強化學習並提出LLM-as-Judge策略,成功將RL應用於通用VQA任務。此外,我們精心設計了獎勵函數和懲罰機制,以實現穩定且合理的圖像調整調用比例。大量實驗證明了我們方法的優越性、效率和有效性。我們的代碼可在https://github.com/dvlab-research/VisionThink 獲取。
English
Recent advancements in vision-language models (VLMs) have improved
performance by increasing the number of visual tokens, which are often
significantly longer than text tokens. However, we observe that most real-world
scenarios do not require such an extensive number of visual tokens. While the
performance drops significantly in a small subset of OCR-related tasks, models
still perform accurately in most other general VQA tasks with only 1/4
resolution. Therefore, we propose to dynamically process distinct samples with
different resolutions, and present a new paradigm for visual token compression,
namely, VisionThink. It starts with a downsampled image and smartly decides
whether it is sufficient for problem solving. Otherwise, the model could output
a special token to request the higher-resolution image. Compared to existing
Efficient VLM methods that compress tokens using fixed pruning ratios or
thresholds, VisionThink autonomously decides whether to compress tokens case by
case. As a result, it demonstrates strong fine-grained visual understanding
capability on OCR-related tasks, and meanwhile saves substantial visual tokens
on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge
strategy to successfully apply RL to general VQA tasks. Moreover, we carefully
design a reward function and penalty mechanism to achieve a stable and
reasonable image resize call ratio. Extensive experiments demonstrate the
superiority, efficiency, and effectiveness of our method. Our code is available
at https://github.com/dvlab-research/VisionThink.