도구를 활용한 강화된 시각적 인지

초록

시각적 추론은 인간 지능의 초석으로, 다양한 시각적 문제를 해결하는 데 필수적인 복잡한 지각 및 논리적 과정을 포괄합니다. 컴퓨터 비전의 발전으로 다양한 지각 작업에 강력한 모델들이 개발되었지만, 이를 일반적인 시각적 추론에 활용하는 것은 여전히 어려운 과제로 남아 있습니다. 기존 연구에서는 시각 모델을 지도 학습을 통해 대형 언어 모델(LLM)에 통합하면 성능이 향상된다는 것을 보여주었지만, 이는 비용이 많이 드는 데이터 생성, 신중한 데이터 필터링에 대한 의존성, 그리고 낮은 일반화 능력과 같은 주요 한계를 가지고 있습니다. 이러한 문제를 해결하기 위해, 우리는 강화 학습을 통해 다중 모달 LLM의 시각적 도구 사용 및 추론 능력을 향상시키는 ReVPT를 제안합니다. 우리는 GRPO를 기반으로 한 새로운 강화 학습 알고리즘을 도입하여, 네 가지 시각적 도구를 사용하여 모델을 학습시키도록 설계했습니다. 광범위한 실험을 통해, 우리의 방법이 SAT, CV-Bench, BLINK 및 MMStar와 같은 여러 지각 중심 벤치마크에서 최첨단 성능을 달성하며, 지도 학습 및 텍스트 기반 강화 학습 파인튜닝 기준선을 크게 능가함을 보여줍니다. 특히, ReVPT-3B와 ReVPT-7B는 CV-Bench에서 인스트럭트 모델을 각각 9.03%와 9.44% 앞섰습니다. 마지막으로, 우리는 광범위한 어블레이션을 통해 강화 학습 기반 시각적 도구 사용에 대한 새로운 통찰력을 커뮤니티에 제공합니다. 우리의 코드는 https://github.com/ls-kelvin/REVPT에서 확인할 수 있습니다.

English

Visual reasoning, a cornerstone of human intelligence, encompasses complex perceptual and logical processes essential for solving diverse visual problems. While advances in computer vision have produced powerful models for various perceptual tasks, leveraging these for general visual reasoning remains challenging. Prior work demonstrates that augmenting LLMs with vision models via supervised finetuning improves performance, but faces key limitations such as expensive data generation, reliance on careful data filtering, and poor generalization. To address these issues, we propose ReVPT to enhance multi-modal LLMs' abilities to reason about and use visual tools through reinforcement learning. We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools. Through extensive experiments, we show that our method achieves state-of-the-art performance on several perception-heavy benchmarks, including SAT, CV-Bench, BLINK and MMStar, significantly outperforming the supervised and text-based RL finetuning baselines. Notably, Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench. Finally, we bring to the community new insights on RL-based visual tool-usage through extensive ablations. Our code is available at https://github.com/ls-kelvin/REVPT.

도구를 활용한 강화된 시각적 인지

Reinforced Visual Perception with Tools

초록

Support