萬事通,眾多專長:設計通用粗到細視覺語言模型
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
December 19, 2023
作者: Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad Almahairi
cs.AI
摘要
大型語言模型(LLMs)處理視覺輸入的能力已經催生了通用視覺系統,通過指導調整統一各種視覺語言(VL)任務。然而,由於視覺領域中輸入輸出格式的巨大多樣性,現有的通用模型無法成功將分割和多圖像輸入與粗粒度任務成功整合到單一框架中。在這項工作中,我們介紹了 VistaLLM,這是一個強大的視覺系統,使用統一框架處理單張和多張輸入圖像上的粗粒度和細粒度 VL 任務。VistaLLM 使用指導圖像分詞器,通過任務描述篩選全局嵌入,從眾多圖像中提取壓縮和精煉特徵。此外,VistaLLM 使用梯度感知自適應採樣技術,將二元分割遮罩表示為序列,顯著改進了以前使用的均勻採樣方法。為了增強 VistaLLM 的所需功能,我們精心編輯了 CoinIt,一個包含 680 萬樣本的全面粗粒度到細粒度指導調整數據集。我們還通過引入一個新的任務 AttCoSeg(屬性級聯合分割)來解決多圖像基礎數據集的缺乏,這進一步提升了模型對多個輸入圖像的推理和基礎定位能力。在各種 V 和 VL 任務上進行的大量實驗表明,VistaLLM 的有效性,它在所有下游任務中都實現了與強基線一致的最新性能。我們的項目頁面位於 https://shramanpramanick.github.io/VistaLLM/。
English
The ability of large language models (LLMs) to process visual inputs has
given rise to general-purpose vision systems, unifying various vision-language
(VL) tasks by instruction tuning. However, due to the enormous diversity in
input-output formats in the vision domain, existing general-purpose models fail
to successfully integrate segmentation and multi-image inputs with coarse-level
tasks into a single framework. In this work, we introduce VistaLLM, a powerful
visual system that addresses coarse- and fine-grained VL tasks over single and
multiple input images using a unified framework. VistaLLM utilizes an
instruction-guided image tokenizer that filters global embeddings using task
descriptions to extract compressed and refined features from numerous images.
Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to
represent binary segmentation masks as sequences, significantly improving over
previously used uniform sampling. To bolster the desired capability of
VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning
dataset with 6.8M samples. We also address the lack of multi-image grounding
datasets by introducing a novel task, AttCoSeg (Attribute-level
Co-Segmentation), which boosts the model's reasoning and grounding capability
over multiple input images. Extensive experiments on a wide range of V- and VL
tasks demonstrate the effectiveness of VistaLLM by achieving consistent
state-of-the-art performance over strong baselines across all downstream tasks.
Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.