像素推理者:通过好奇心驱动的强化学习激励像素空间推理
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
May 21, 2025
作者: Alex Su, Haozhe Wang, Weimin Ren, Fangzhen Lin, Wenhu Chen
cs.AI
摘要
链式思维推理显著提升了大型语言模型(LLMs)在多个领域的表现。然而,这一推理过程仅限于文本空间,限制了其在视觉密集型任务中的有效性。为解决这一局限,我们提出了像素空间推理的概念。在这一新颖框架下,视觉语言模型(VLMs)配备了一系列视觉推理操作,如放大和选择帧。这些操作使VLMs能够直接检查、询问并从视觉证据中推断,从而提升视觉任务的推理准确性。在VLMs中培养此类像素空间推理能力面临显著挑战,包括模型初始能力的不平衡及其对新引入像素空间操作的抵触。我们通过两阶段训练方法应对这些挑战。第一阶段采用合成推理轨迹的指令微调,使模型熟悉新的视觉操作。随后,强化学习(RL)阶段利用好奇心驱动的奖励机制,平衡像素空间推理与文本推理之间的探索。借助这些视觉操作,VLMs能够与复杂视觉输入(如信息丰富的图像或视频)互动,主动收集必要信息。我们证明,该方法显著提升了VLMs在多种视觉推理基准上的表现。我们的7B模型\model在V* bench上达到84%,在TallyQA-Complex上达到74%,在InfographicsVQA上达到84%,创下了开源模型迄今为止的最高准确率。这些结果凸显了像素空间推理的重要性及我们框架的有效性。
English
Chain-of-thought reasoning has significantly improved the performance of
Large Language Models (LLMs) across various domains. However, this reasoning
process has been confined exclusively to textual space, limiting its
effectiveness in visually intensive tasks. To address this limitation, we
introduce the concept of reasoning in the pixel-space. Within this novel
framework, Vision-Language Models (VLMs) are equipped with a suite of visual
reasoning operations, such as zoom-in and select-frame. These operations enable
VLMs to directly inspect, interrogate, and infer from visual evidences, thereby
enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space
reasoning capabilities in VLMs presents notable challenges, including the
model's initially imbalanced competence and its reluctance to adopt the newly
introduced pixel-space operations. We address these challenges through a
two-phase training approach. The first phase employs instruction tuning on
synthesized reasoning traces to familiarize the model with the novel visual
operations. Following this, a reinforcement learning (RL) phase leverages a
curiosity-driven reward scheme to balance exploration between pixel-space
reasoning and textual reasoning. With these visual operations, VLMs can
interact with complex visual inputs, such as information-rich images or videos
to proactively gather necessary information. We demonstrate that this approach
significantly improves VLM performance across diverse visual reasoning
benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on
TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy
achieved by any open-source model to date. These results highlight the
importance of pixel-space reasoning and the effectiveness of our framework.Summary
AI-Generated Summary