PixelThink：邁向高效的像素鏈推理

摘要

現有的推理分割方法通常使用圖像-文本對及其對應的掩碼標籤來微調多模態大語言模型（MLLMs）。然而，這些方法在缺乏明確推理過程的情況下，對分佈外場景的泛化能力有限。儘管最近的研究通過群體相對策略優化（GRPO）利用強化學習來增強推理能力，但它們常常面臨過度思考的問題——無論任務複雜度如何，都生成冗長的推理鏈。這導致計算成本升高且對推理質量的控制有限。為解決這一問題，我們提出了PixelThink，這是一種簡單而有效的方案，它整合了外部估計的任務難度和內部測量的模型不確定性，以在強化學習範式內調節推理生成。該模型學會根據場景複雜度和預測置信度來壓縮推理長度。為支持全面評估，我們引入了ReasonSeg-Diff，這是一個擴展的基準測試集，包含註釋的推理參考和難度評分，以及一套旨在聯合評估分割準確性、推理質量和效率的指標。實驗結果表明，所提出的方法提高了推理效率和整體分割性能。我們的工作為高效且可解釋的多模態理解提供了新穎的視角。代碼和模型將公開提供。

English

Existing reasoning segmentation approaches typically fine-tune multimodal large language models (MLLMs) using image-text pairs and corresponding mask labels. However, they exhibit limited generalization to out-of-distribution scenarios without an explicit reasoning process. Although recent efforts leverage reinforcement learning through group-relative policy optimization (GRPO) to enhance reasoning ability, they often suffer from overthinking - producing uniformly verbose reasoning chains irrespective of task complexity. This results in elevated computational costs and limited control over reasoning quality. To address this problem, we propose PixelThink, a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty to regulate reasoning generation within a reinforcement learning paradigm. The model learns to compress reasoning length in accordance with scene complexity and predictive confidence. To support comprehensive evaluation, we introduce ReasonSeg-Diff, an extended benchmark with annotated reasoning references and difficulty scores, along with a suite of metrics designed to assess segmentation accuracy, reasoning quality, and efficiency jointly. Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance. Our work contributes novel perspectives towards efficient and interpretable multimodal understanding. The code and model will be publicly available.

PixelThink：邁向高效的像素鏈推理

PixelThink: Towards Efficient Chain-of-Pixel Reasoning

摘要

Support