像素推理者:通過好奇心驅動的強化學習激勵像素空間推理
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
May 21, 2025
作者: Alex Su, Haozhe Wang, Weimin Ren, Fangzhen Lin, Wenhu Chen
cs.AI
摘要
鏈式思維推理已顯著提升了大型語言模型(LLMs)在多個領域的表現。然而,這一推理過程僅限於文本空間,限制了其在視覺密集型任務中的有效性。為解決這一限制,我們引入了像素空間推理的概念。在此新框架下,視覺語言模型(VLMs)配備了一套視覺推理操作,如放大和選取幀。這些操作使VLMs能夠直接檢查、詢問並從視覺證據中推斷,從而提升視覺任務的推理準確性。培養VLMs的像素空間推理能力面臨顯著挑戰,包括模型初始能力的不平衡以及其對新引入像素空間操作的抗拒。我們通過兩階段訓練方法應對這些挑戰。第一階段採用指令微調於合成的推理軌跡,使模型熟悉新視覺操作。隨後,強化學習(RL)階段利用好奇心驅動的獎勵機制,平衡像素空間推理與文本推理之間的探索。借助這些視覺操作,VLMs能夠與複雜視覺輸入(如信息豐富的圖像或視頻)互動,主動收集必要信息。我們證明,該方法在多樣視覺推理基準上顯著提升了VLM性能。我們的7B模型,\model,在V* bench上達到84%,在TallyQA-Complex上達到74%,在InfographicsVQA上達到84%,標誌著迄今為止任何開源模型所達到的最高準確率。這些結果凸顯了像素空間推理的重要性及我們框架的有效性。
English
Chain-of-thought reasoning has significantly improved the performance of
Large Language Models (LLMs) across various domains. However, this reasoning
process has been confined exclusively to textual space, limiting its
effectiveness in visually intensive tasks. To address this limitation, we
introduce the concept of reasoning in the pixel-space. Within this novel
framework, Vision-Language Models (VLMs) are equipped with a suite of visual
reasoning operations, such as zoom-in and select-frame. These operations enable
VLMs to directly inspect, interrogate, and infer from visual evidences, thereby
enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space
reasoning capabilities in VLMs presents notable challenges, including the
model's initially imbalanced competence and its reluctance to adopt the newly
introduced pixel-space operations. We address these challenges through a
two-phase training approach. The first phase employs instruction tuning on
synthesized reasoning traces to familiarize the model with the novel visual
operations. Following this, a reinforcement learning (RL) phase leverages a
curiosity-driven reward scheme to balance exploration between pixel-space
reasoning and textual reasoning. With these visual operations, VLMs can
interact with complex visual inputs, such as information-rich images or videos
to proactively gather necessary information. We demonstrate that this approach
significantly improves VLM performance across diverse visual reasoning
benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on
TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy
achieved by any open-source model to date. These results highlight the
importance of pixel-space reasoning and the effectiveness of our framework.Summary
AI-Generated Summary