Pixel Reasoner: 好奇心駆動型強化学習によるピクセル空間推論の促進

要旨

連鎖的思考推論（Chain-of-thought reasoning）は、大規模言語モデル（LLMs）の性能を様々な領域で大幅に向上させてきました。しかし、この推論プロセスはテキスト空間に限定されており、視覚的に高度なタスクにおける有効性が制限されていました。この制約を克服するため、我々はピクセル空間での推論という概念を導入します。この新しいフレームワーク内では、視覚言語モデル（VLMs）にズームインやフレーム選択といった視覚的推論操作を装備します。これらの操作により、VLMsは視覚的証拠を直接検査し、問いかけ、推論することが可能となり、視覚タスクにおける推論の忠実度が向上します。VLMsにこのようなピクセル空間推論能力を育成することは、モデルの初期段階での能力の不均衡や、新たに導入されたピクセル空間操作に対する抵抗感といった課題を伴います。我々はこれらの課題に対処するため、二段階のトレーニングアプローチを採用します。第一段階では、合成された推論トレースを用いた指示チューニングを行い、モデルに新しい視覚操作を慣れさせます。その後、強化学習（RL）フェーズでは、好奇心駆動型の報酬スキームを活用して、ピクセル空間推論とテキスト推論の間の探索バランスを取ります。これらの視覚操作により、VLMsは情報豊富な画像や動画といった複雑な視覚入力と相互作用し、必要な情報を積極的に収集することが可能となります。我々は、このアプローチが多様な視覚推論ベンチマークにおいてVLMの性能を大幅に向上させることを実証します。我々の7Bモデル、\modelは、V* benchで84%、TallyQA-Complexで74%、InfographicsVQAで84%を達成し、これまでにオープンソースモデルが達成した最高精度を記録しました。これらの結果は、ピクセル空間推論の重要性と我々のフレームワークの有効性を強調しています。

English

Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.

Pixel Reasoner: 好奇心駆動型強化学習によるピクセル空間推論の促進

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

要旨

Support