VisionThink: 강화 학습을 통한 스마트하고 효율적인 비전 언어 모델

초록

최근 비전-언어 모델(VLMs)의 발전은 텍스트 토큰보다 훨씬 더 긴 시각적 토큰의 수를 증가시켜 성능을 향상시켰습니다. 그러나 우리는 대부분의 실제 시나리오에서 이러한 방대한 수의 시각적 토큰이 필요하지 않다는 것을 관찰했습니다. OCR 관련 작업의 작은 부분집합에서는 성능이 크게 저하되지만, 모델은 1/4 해상도에서도 대부분의 일반적인 VQA 작업에서 정확하게 수행됩니다. 따라서 우리는 각기 다른 샘플을 다양한 해상도로 동적으로 처리하고, 시각적 토큰 압축을 위한 새로운 패러다임인 VisionThink를 제안합니다. 이는 다운샘플링된 이미지로 시작하여 문제 해결에 충분한지 스마트하게 결정합니다. 그렇지 않은 경우, 모델은 더 높은 해상도의 이미지를 요청하는 특수 토큰을 출력할 수 있습니다. 고정된 가지치기 비율이나 임계값을 사용하여 토큰을 압축하는 기존의 Efficient VLM 방법과 비교하여, VisionThink는 사례별로 토큰을 압축할지 여부를 자율적으로 결정합니다. 결과적으로, OCR 관련 작업에서 강력한 세밀한 시각적 이해 능력을 보여주는 동시에 더 간단한 작업에서는 상당한 시각적 토큰을 절약합니다. 우리는 강화 학습을 채택하고 일반적인 VQA 작업에 RL을 성공적으로 적용하기 위해 LLM-as-Judge 전략을 제안합니다. 또한, 안정적이고 합리적인 이미지 크기 조정 호출 비율을 달성하기 위해 보상 함수와 패널티 메커니즘을 신중하게 설계했습니다. 광범위한 실험을 통해 우리 방법의 우수성, 효율성 및 효과성을 입증했습니다. 우리의 코드는 https://github.com/dvlab-research/VisionThink에서 확인할 수 있습니다.

English

Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.

VisionThink: 강화 학습을 통한 스마트하고 효율적인 비전 언어 모델

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

초록

Support