PixelThink: 効率的なピクセル連鎖推論に向けて

要旨

既存の推論セグメンテーション手法は、通常、画像-テキストペアと対応するマスクラベルを使用してマルチモーダル大規模言語モデル（MLLMs）を微調整します。しかし、明示的な推論プロセスなしでは、分布外シナリオへの汎化能力が限られています。最近の研究では、グループ相対ポリシー最適化（GRPO）を通じて強化学習を活用し、推論能力を向上させようとしていますが、過剰思考（overthinking）に陥りがちです。これは、タスクの複雑さに関係なく、一様に冗長な推論チェーンを生成するため、計算コストが増大し、推論品質の制御が困難になります。この問題を解決するため、我々はPixelThinkを提案します。これは、外部から推定されたタスクの難易度と内部から測定されたモデルの不確実性を統合し、強化学習の枠組み内で推論生成を調整するシンプルかつ効果的な手法です。このモデルは、シーンの複雑さと予測信頼度に応じて推論の長さを圧縮することを学習します。包括的な評価を支援するため、ReasonSeg-Diffという拡張ベンチマークを導入しました。これには、注釈付きの推論参照と難易度スコア、およびセグメンテーション精度、推論品質、効率を総合的に評価するための一連のメトリクスが含まれています。実験結果は、提案手法が推論効率と全体的なセグメンテーション性能の両方を向上させることを示しています。我々の研究は、効率的で解釈可能なマルチモーダル理解に向けた新たな視点を提供します。コードとモデルは公開されます。

English

Existing reasoning segmentation approaches typically fine-tune multimodal large language models (MLLMs) using image-text pairs and corresponding mask labels. However, they exhibit limited generalization to out-of-distribution scenarios without an explicit reasoning process. Although recent efforts leverage reinforcement learning through group-relative policy optimization (GRPO) to enhance reasoning ability, they often suffer from overthinking - producing uniformly verbose reasoning chains irrespective of task complexity. This results in elevated computational costs and limited control over reasoning quality. To address this problem, we propose PixelThink, a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty to regulate reasoning generation within a reinforcement learning paradigm. The model learns to compress reasoning length in accordance with scene complexity and predictive confidence. To support comprehensive evaluation, we introduce ReasonSeg-Diff, an extended benchmark with annotated reasoning references and difficulty scores, along with a suite of metrics designed to assess segmentation accuracy, reasoning quality, and efficiency jointly. Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance. Our work contributes novel perspectives towards efficient and interpretable multimodal understanding. The code and model will be publicly available.

PixelThink: 効率的なピクセル連鎖推論に向けて

PixelThink: Towards Efficient Chain-of-Pixel Reasoning

要旨

Support