透過輕量級遮罩解碼釋放多語言大模型在指代表達分割中的潛力
Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode
August 6, 2025
作者: Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, Hong Wang
cs.AI
摘要
參考表達式分割(Reference Expression Segmentation, RES)旨在根據指代表達式分割圖像區域,並隨著多模態大模型(Multimodal Large Models, MLLMs)的興起而受到廣泛關注。儘管MLLMs在語義理解方面表現出色,但其基於令牌生成的範式在像素級密集預測方面存在困難。現有的RES方法要麼將MLLMs與擁有632M網絡參數的Segment Anything Model(SAM)結合,要麼採用不依賴SAM的輕量級管道,但這會犧牲準確性。為了解決性能與成本之間的權衡問題,我們特別提出了MLLMSeg,這是一種新穎的框架,它充分利用了MLLM視覺編碼器中固有的視覺細節特徵,而無需引入額外的視覺編碼器。此外,我們提出了一種細節增強且語義一致的特徵融合模塊(Detail-enhanced and Semantic-consistent Feature Fusion, DSFF),該模塊將與細節相關的視覺特徵與MLLM中大語言模型(Large Language Model, LLM)輸出的語義相關特徵充分融合。最後,我們建立了一個僅有34M網絡參數的輕量級掩碼解碼器,該解碼器最優化地利用了視覺編碼器中的詳細空間特徵和LLM中的語義特徵,以實現精確的掩碼預測。大量實驗表明,我們的方法在性能和成本之間取得了更好的平衡,普遍超越了基於SAM和不依賴SAM的競爭對手。代碼可在https://github.com/jcwang0602/MLLMSeg獲取。
English
Reference Expression Segmentation (RES) aims to segment image regions
specified by referring expressions and has become popular with the rise of
multimodal large models (MLLMs). While MLLMs excel in semantic understanding,
their token-generation paradigm struggles with pixel-level dense prediction.
Existing RES methods either couple MLLMs with the parameter-heavy Segment
Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight
pipelines that sacrifice accuracy. To address the trade-off between performance
and cost, we specifically propose MLLMSeg, a novel framework that fully
exploits the inherent visual detail features encoded in the MLLM vision encoder
without introducing an extra visual encoder. Besides, we propose a
detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully
integrates the detail-related visual feature with the semantic-related feature
output by the large language model (LLM) of MLLM. Finally, we establish a
light-weight mask decoder with only 34M network parameters that optimally
leverages detailed spatial features from the visual encoder and semantic
features from the LLM to achieve precise mask prediction. Extensive experiments
demonstrate that our method generally surpasses both SAM-based and SAM-free
competitors, striking a better balance between performance and cost. Code is
available at https://github.com/jcwang0602/MLLMSeg.