PixelThink:邁向高效的像素鏈推理
PixelThink: Towards Efficient Chain-of-Pixel Reasoning
May 29, 2025
作者: Song Wang, Gongfan Fang, Lingdong Kong, Xiangtai Li, Jianyun Xu, Sheng Yang, Qiang Li, Jianke Zhu, Xinchao Wang
cs.AI
摘要
現有的推理分割方法通常使用圖像-文本對及其對應的掩碼標籤來微調多模態大語言模型(MLLMs)。然而,這些方法在缺乏明確推理過程的情況下,對分佈外場景的泛化能力有限。儘管最近的研究通過群體相對策略優化(GRPO)利用強化學習來增強推理能力,但它們常常面臨過度思考的問題——無論任務複雜度如何,都生成冗長的推理鏈。這導致計算成本升高且對推理質量的控制有限。為解決這一問題,我們提出了PixelThink,這是一種簡單而有效的方案,它整合了外部估計的任務難度和內部測量的模型不確定性,以在強化學習範式內調節推理生成。該模型學會根據場景複雜度和預測置信度來壓縮推理長度。為支持全面評估,我們引入了ReasonSeg-Diff,這是一個擴展的基準測試集,包含註釋的推理參考和難度評分,以及一套旨在聯合評估分割準確性、推理質量和效率的指標。實驗結果表明,所提出的方法提高了推理效率和整體分割性能。我們的工作為高效且可解釋的多模態理解提供了新穎的視角。代碼和模型將公開提供。
English
Existing reasoning segmentation approaches typically fine-tune multimodal
large language models (MLLMs) using image-text pairs and corresponding mask
labels. However, they exhibit limited generalization to out-of-distribution
scenarios without an explicit reasoning process. Although recent efforts
leverage reinforcement learning through group-relative policy optimization
(GRPO) to enhance reasoning ability, they often suffer from overthinking -
producing uniformly verbose reasoning chains irrespective of task complexity.
This results in elevated computational costs and limited control over reasoning
quality. To address this problem, we propose PixelThink, a simple yet effective
scheme that integrates externally estimated task difficulty and internally
measured model uncertainty to regulate reasoning generation within a
reinforcement learning paradigm. The model learns to compress reasoning length
in accordance with scene complexity and predictive confidence. To support
comprehensive evaluation, we introduce ReasonSeg-Diff, an extended benchmark
with annotated reasoning references and difficulty scores, along with a suite
of metrics designed to assess segmentation accuracy, reasoning quality, and
efficiency jointly. Experimental results demonstrate that the proposed approach
improves both reasoning efficiency and overall segmentation performance. Our
work contributes novel perspectives towards efficient and interpretable
multimodal understanding. The code and model will be publicly available.Summary
AI-Generated Summary