Patch-as-Decodable-Token:邁向MLLMs中統一的多模態視覺任務
Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
October 2, 2025
作者: Yongyi Su, Haojie Zhang, Shijie Li, Nanqing Liu, Jingyi Liao, Junyi Pan, Yuan Liu, Xiaofen Xing, Chong Sun, Chen Li, Nancy F. Chen, Shuicheng Yan, Xulei Yang, Xun Xu
cs.AI
摘要
近年來,多模態大型語言模型(MLLMs)取得了快速進展。然而,現有的視覺任務方法往往依賴於間接表示,例如將檢測座標生成為文本,這限制了性能並阻礙了如分割等密集預測任務的實現。為克服這些挑戰,我們引入了「Patch-as-Decodable Token」(PaDT),這是一種統一範式,使MLLMs能夠直接生成文本和多樣化的視覺輸出。PaDT的核心是視覺參考標記(VRTs),這些標記源自查詢圖像的視覺補丁嵌入,並與LLM的輸出文本標記無縫交織。一個輕量級的解碼器隨後將LLM的輸出轉化為檢測、分割和定位預測。與先前方法不同,PaDT在每次前向傳播時獨立處理VRTs,並動態擴展嵌入表,從而提高了相似物體的定位和區分能力。我們進一步為PaDT定制了訓練策略,通過隨機選擇VRTs進行監督微調,並引入了一種魯棒的逐標記交叉熵損失。我們在四項視覺感知與理解任務上的實證研究表明,PaDT始終實現了最先進的性能,即使與顯著更大的MLLM模型相比也是如此。代碼可在https://github.com/Gorilla-Lab-SCUT/PaDT獲取。
English
Multimodal large language models (MLLMs) have advanced rapidly in recent
years. However, existing approaches for vision tasks often rely on indirect
representations, such as generating coordinates as text for detection, which
limits performance and prevents dense prediction tasks like segmentation. To
overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a
unified paradigm that enables MLLMs to directly generate both textual and
diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs),
derived from visual patch embeddings of query images and interleaved seamlessly
with LLM's output textual tokens. A lightweight decoder then transforms LLM's
outputs into detection, segmentation, and grounding predictions. Unlike prior
methods, PaDT processes VRTs independently at each forward pass and dynamically
expands the embedding table, thus improving localization and differentiation
among similar objects. We further tailor a training strategy for PaDT by
randomly selecting VRTs for supervised fine-tuning and introducing a robust
per-token cross-entropy loss. Our empirical studies across four visual
perception and understanding tasks suggest PaDT consistently achieving
state-of-the-art performance, even compared with significantly larger MLLM
models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.