ChatPaper.aiChatPaper

REF-VLM:基於三元組的指代範式,實現統一視覺解碼

REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

March 10, 2025
作者: Yan Tai, Luhao Zhu, Zhiqiang Chen, Ynan Ding, Yiying Dong, Xiaohong Liu, Guodong Guo
cs.AI

摘要

多模态大型語言模型(MLLMs)在經過大規模數據集訓練後,展現出在各種視覺-語言任務中的強大零樣本能力。然而,對於僅以文本輸出表示的密集預測任務(如語義分割和關鍵點檢測),MLLMs面臨著顯著的挑戰。同時,當前利用潛在嵌入進行視覺任務解碼的MLLMs通常表現出在多任務學習和多粒度場景中的適應性有限。在本研究中,我們提出了REF-VLM,這是一個用於統一訓練各種視覺解碼任務的端到端框架。為應對複雜的視覺解碼場景,我們引入了基於三元組的參考範式(TRP),該範式通過三元組結構明確解耦視覺解碼任務中的三個關鍵維度:概念、解碼類型和目標。TRP採用符號分隔符來強制結構化表示學習,增強模型輸出的可解析性和可解釋性。此外,我們構建了視覺任務指令跟隨數據集(VTInstruct),這是一個包含超過1億個多模態對話樣本、涵蓋25種任務類型的大規模多任務數據集。除了文本輸入和輸出,VT-Instruct還整合了各種視覺提示,如點、框、塗鴉和遮罩,並生成由文本和視覺單元(如框、關鍵點、深度和遮罩)組成的輸出。不同視覺提示和視覺單元的組合產生了多種任務類型,顯著擴展了REF-VLM的適用性。定性和定量實驗均表明,我們的REF-VLM在多種標準基準測試中優於其他MLLMs。代碼、數據集和演示可在https://github.com/MacavityT/REF-VLM獲取。
English
Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot capabilities across diverse vision-language tasks after training on mega-scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi-task learning and multi-granularity scenarios. In this work, we present REF-VLM, an end-to-end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the Triplet-Based Referring Paradigm (TRP), which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct Visual-Task Instruction Following Dataset (VTInstruct), a large-scale multi-task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF-VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF-VLM outperforms other MLLMs across a variety of standard benchmarks. The code, dataset, and demo available at https://github.com/MacavityT/REF-VLM.

Summary

AI-Generated Summary

PDF21March 11, 2025