PixelThink: 효율적인 픽셀 연쇄 추론을 향하여

초록

기존의 추론 기반 세분화 접근법들은 일반적으로 이미지-텍스트 쌍과 해당 마스크 레이블을 사용하여 다중 모달 대형 언어 모델(MLLMs)을 미세 조정합니다. 그러나 이러한 방법들은 명시적인 추론 과정 없이는 분포 외 시나리오에 대한 일반화 능력이 제한적입니다. 최근에는 그룹 상대적 정책 최적화(GRPO)를 통한 강화 학습을 활용하여 추론 능력을 향상시키려는 시도가 있었지만, 이러한 방법들은 과도한 사고(overthinking) 문제를 겪는 경우가 많습니다. 즉, 작업의 복잡성과 관계없이 일관되게 장황한 추론 체인을 생성하여 계산 비용이 증가하고 추론 품질에 대한 통제력이 제한됩니다. 이 문제를 해결하기 위해, 우리는 PixelThink을 제안합니다. 이는 강화 학습 패러다임 내에서 추론 생성을 조절하기 위해 외부적으로 추정된 작업 난이도와 내부적으로 측정된 모델 불확실성을 통합한 간단하면서도 효과적인 방식입니다. 이 모델은 장면 복잡성과 예측 신뢰도에 따라 추론 길이를 압축하는 방법을 학습합니다. 포괄적인 평가를 지원하기 위해, 우리는 주석 처리된 추론 참조와 난이도 점수가 포함된 확장된 벤치마크인 ReasonSeg-Diff와 세분화 정확도, 추론 품질, 효율성을 종합적으로 평가하기 위한 메트릭 세트를 도입했습니다. 실험 결과는 제안된 접근법이 추론 효율성과 전체 세분화 성능을 모두 개선함을 보여줍니다. 우리의 작업은 효율적이고 해석 가능한 다중 모달 이해를 위한 새로운 관점을 제공합니다. 코드와 모델은 공개될 예정입니다.

English

Existing reasoning segmentation approaches typically fine-tune multimodal large language models (MLLMs) using image-text pairs and corresponding mask labels. However, they exhibit limited generalization to out-of-distribution scenarios without an explicit reasoning process. Although recent efforts leverage reinforcement learning through group-relative policy optimization (GRPO) to enhance reasoning ability, they often suffer from overthinking - producing uniformly verbose reasoning chains irrespective of task complexity. This results in elevated computational costs and limited control over reasoning quality. To address this problem, we propose PixelThink, a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty to regulate reasoning generation within a reinforcement learning paradigm. The model learns to compress reasoning length in accordance with scene complexity and predictive confidence. To support comprehensive evaluation, we introduce ReasonSeg-Diff, an extended benchmark with annotated reasoning references and difficulty scores, along with a suite of metrics designed to assess segmentation accuracy, reasoning quality, and efficiency jointly. Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance. Our work contributes novel perspectives towards efficient and interpretable multimodal understanding. The code and model will be publicly available.

PixelThink: 효율적인 픽셀 연쇄 추론을 향하여

PixelThink: Towards Efficient Chain-of-Pixel Reasoning

초록

Support