通过轻量级掩码解码释放多模态大语言模型在指代表达分割中的潜力

摘要

参考表达式分割（Reference Expression Segmentation, RES）旨在根据指代表达式分割图像区域，并随着多模态大模型（Multimodal Large Models, MLLMs）的兴起而广受欢迎。尽管MLLMs在语义理解方面表现出色，但其基于令牌生成的范式在处理像素级密集预测时面临挑战。现有的RES方法要么将MLLMs与参数庞大的Segment Anything Model（SAM，拥有632M网络参数）耦合，要么采用牺牲精度的轻量级无SAM流程。为解决性能与成本之间的权衡问题，我们特别提出了MLLMSeg，这一新颖框架充分利用了MLLM视觉编码器中固有的视觉细节特征，而无需引入额外的视觉编码器。此外，我们提出了一个细节增强且语义一致的特征融合模块（Detail-enhanced and Semantic-consistent Feature Fusion, DSFF），该模块将细节相关的视觉特征与MLLM中大语言模型（Large Language Model, LLM）输出的语义相关特征深度融合。最后，我们建立了一个仅含34M网络参数的轻量级掩码解码器，它最优地利用了视觉编码器中的详细空间特征和LLM的语义特征，以实现精确的掩码预测。大量实验证明，我们的方法普遍超越了基于SAM和无SAM的竞争对手，在性能与成本之间实现了更好的平衡。代码可在https://github.com/jcwang0602/MLLMSeg获取。

English

Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at https://github.com/jcwang0602/MLLMSeg.

通过轻量级掩码解码释放多模态大语言模型在指代表达分割中的潜力

Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode

摘要

Support