ChatPaper.aiChatPaper

像素思维:迈向高效的像素链推理

PixelThink: Towards Efficient Chain-of-Pixel Reasoning

May 29, 2025
作者: Song Wang, Gongfan Fang, Lingdong Kong, Xiangtai Li, Jianyun Xu, Sheng Yang, Qiang Li, Jianke Zhu, Xinchao Wang
cs.AI

摘要

现有的推理分割方法通常通过图像-文本对及相应的掩码标签对多模态大语言模型(MLLMs)进行微调。然而,这些方法在缺乏明确推理过程的情况下,对分布外场景的泛化能力有限。尽管近期研究通过群体相对策略优化(GRPO)利用强化学习来提升推理能力,但它们常陷入过度思考的困境——无论任务复杂度如何,均生成冗长的推理链条。这导致了计算成本的增加和对推理质量控制的不足。为解决这一问题,我们提出了PixelThink,一种简洁而有效的方案,它结合外部估计的任务难度与内部度量的模型不确定性,在强化学习框架内调控推理生成。该模型学会根据场景复杂度和预测置信度压缩推理长度。为支持全面评估,我们引入了ReasonSeg-Diff,一个扩展的基准测试集,包含标注的推理参考和难度评分,以及一套旨在同时评估分割准确性、推理质量和效率的指标。实验结果表明,所提方法在提升推理效率的同时,也改善了整体分割性能。我们的工作为高效且可解释的多模态理解提供了新颖视角。代码和模型将公开提供。
English
Existing reasoning segmentation approaches typically fine-tune multimodal large language models (MLLMs) using image-text pairs and corresponding mask labels. However, they exhibit limited generalization to out-of-distribution scenarios without an explicit reasoning process. Although recent efforts leverage reinforcement learning through group-relative policy optimization (GRPO) to enhance reasoning ability, they often suffer from overthinking - producing uniformly verbose reasoning chains irrespective of task complexity. This results in elevated computational costs and limited control over reasoning quality. To address this problem, we propose PixelThink, a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty to regulate reasoning generation within a reinforcement learning paradigm. The model learns to compress reasoning length in accordance with scene complexity and predictive confidence. To support comprehensive evaluation, we introduce ReasonSeg-Diff, an extended benchmark with annotated reasoning references and difficulty scores, along with a suite of metrics designed to assess segmentation accuracy, reasoning quality, and efficiency jointly. Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance. Our work contributes novel perspectives towards efficient and interpretable multimodal understanding. The code and model will be publicly available.

Summary

AI-Generated Summary

PDF11May 30, 2025