ChatPaper.aiChatPaper

SAMTok:双词描绘任意掩码

SAMTok: Representing Any Mask with Two Words

January 22, 2026
作者: Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li
cs.AI

摘要

像素级能力对于构建交互式智能系统至关重要。然而,由于复杂的区域级编码器、专业化的分割解码器以及相互冲突的训练目标,像素级多模态大语言模型(MLLMs)的扩展仍面临挑战。为解决这些问题,我们提出SAMTok——一种离散掩码标记器,可将任意区域掩码转换为两种特殊标记,并以高保真度利用这些标记重建掩码。通过将掩码视为新型语言标记,SAMTok使基础MLLMs(如QwenVL系列)能够通过标准下一标记预测和简单强化学习掌握像素级能力,无需修改模型架构或设计专用损失函数。SAMTok基于SAM2构建,使用掩码编码器和残差向量量化器在2.09亿个多样化掩码上进行训练,生成离散、紧凑且信息丰富的标记。借助500万个经SAMTok格式化的掩码理解与生成数据样本,QwenVL-SAMTok在区域描述、区域视觉问答、指代对话、指代分割、场景图解析和多轮交互分割任务中达到业界最优或相当水平。我们进一步引入文本答案匹配奖励机制,通过高效强化学习实现掩码生成,在GRES和GCG基准测试中取得显著提升。实验结果表明,该范式为MLLMs赋予强大像素级能力提供了一条可扩展且简洁的路径。代码与模型已开源。
English
Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.
PDF311January 24, 2026