ChatPaper.aiChatPaper

SAMTok:用兩個詞語表示任意遮罩

SAMTok: Representing Any Mask with Two Words

January 22, 2026
作者: Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li
cs.AI

摘要

像素級能力對於建構互動式智慧系統至關重要。然而,由於複雜的區域級編碼器、專用的分割解碼器以及不相容的訓練目標,像素級多模態大型語言模型(MLLMs)仍難以擴展。為解決這些挑戰,我們提出SAMTok——一種離散遮罩標記器,可將任意區域遮罩轉換為兩個特殊標記,並以高保真度利用這些標記重建遮罩。通過將遮罩視為新型語言標記,SAMTok使基礎MLLMs(如QwenVL系列)能透過標準的下一個標記預測和簡單的強化學習來掌握像素級能力,無需修改模型架構或設計專用損失函數。SAMTok基於SAM2構建,使用遮罩編碼器和殘差向量量化器在2.09億個多樣化遮罩上進行訓練,以產生離散、緊湊且資訊豐富的標記。透過500萬個SAMTok格式的遮罩理解與生成數據樣本,QwenVL-SAMTok在區域描述、區域視覺問答、接地對話、指代表達分割、場景圖解析和多輪互動分割任務中達到業界頂尖或相當的效能。我們進一步引入文本答案匹配獎勵機制,實現高效的遮罩生成強化學習,在GRES和GCG基準測試中取得顯著提升。實驗結果證明,此方法為MLLMs賦予強大像素級能力提供了一種可擴展且簡潔的範式。我們的程式碼與模型均已開源。
English
Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.
PDF311January 24, 2026