像素推理者：通過好奇心驅動的強化學習激勵像素空間推理

摘要

鏈式思維推理已顯著提升了大型語言模型（LLMs）在多個領域的表現。然而，這一推理過程僅限於文本空間，限制了其在視覺密集型任務中的有效性。為解決這一限制，我們引入了像素空間推理的概念。在此新框架下，視覺語言模型（VLMs）配備了一套視覺推理操作，如放大和選取幀。這些操作使VLMs能夠直接檢查、詢問並從視覺證據中推斷，從而提升視覺任務的推理準確性。培養VLMs的像素空間推理能力面臨顯著挑戰，包括模型初始能力的不平衡以及其對新引入像素空間操作的抗拒。我們通過兩階段訓練方法應對這些挑戰。第一階段採用指令微調於合成的推理軌跡，使模型熟悉新視覺操作。隨後，強化學習（RL）階段利用好奇心驅動的獎勵機制，平衡像素空間推理與文本推理之間的探索。借助這些視覺操作，VLMs能夠與複雜視覺輸入（如信息豐富的圖像或視頻）互動，主動收集必要信息。我們證明，該方法在多樣視覺推理基準上顯著提升了VLM性能。我們的7B模型，\model，在V* bench上達到84%，在TallyQA-Complex上達到74%，在InfographicsVQA上達到84%，標誌著迄今為止任何開源模型所達到的最高準確率。這些結果凸顯了像素空間推理的重要性及我們框架的有效性。

English

Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.

像素推理者：通過好奇心驅動的強化學習激勵像素空間推理

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

摘要

Support