多才多艺,多方面精通:设计通用粗到细视觉-语言模型
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
December 19, 2023
作者: Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad Almahairi
cs.AI
摘要
大型语言模型(LLMs)处理视觉输入的能力催生了通用视觉系统,通过指导微调统一各种视觉-语言(VL)任务。然而,由于视觉领域输入输出格式的巨大多样性,现有的通用模型未能成功将分割和多图像输入与粗粒度任务整合到一个框架中。在这项工作中,我们介绍了VistaLLM,一个强大的视觉系统,利用统一框架处理单个和多个输入图像上的粗粒度和细粒度VL任务。VistaLLM利用指导图像标记器,通过任务描述筛选全局嵌入,从众多图像中提取压缩和精炼特征。此外,VistaLLM采用梯度感知自适应采样技术,将二进制分割掩模表示为序列,显著改善了先前使用的均匀采样。为了增强VistaLLM的期望能力,我们精心策划了CoinIt,一个包含680万样本的全面粗粒度到细粒度指导微调数据集。我们还通过引入一项新任务AttCoSeg(属性级联合分割)来解决缺乏多图像基准数据集的问题,这有助于提升模型在多个输入图像上的推理和基准能力。在广泛的V-和VL任务上进行的大量实验表明,VistaLLM的有效性,通过在所有下游任务中稳定超越强基线,取得了一致的最新性能。我们的项目页面位于https://shramanpramanick.github.io/VistaLLM/。
English
The ability of large language models (LLMs) to process visual inputs has
given rise to general-purpose vision systems, unifying various vision-language
(VL) tasks by instruction tuning. However, due to the enormous diversity in
input-output formats in the vision domain, existing general-purpose models fail
to successfully integrate segmentation and multi-image inputs with coarse-level
tasks into a single framework. In this work, we introduce VistaLLM, a powerful
visual system that addresses coarse- and fine-grained VL tasks over single and
multiple input images using a unified framework. VistaLLM utilizes an
instruction-guided image tokenizer that filters global embeddings using task
descriptions to extract compressed and refined features from numerous images.
Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to
represent binary segmentation masks as sequences, significantly improving over
previously used uniform sampling. To bolster the desired capability of
VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning
dataset with 6.8M samples. We also address the lack of multi-image grounding
datasets by introducing a novel task, AttCoSeg (Attribute-level
Co-Segmentation), which boosts the model's reasoning and grounding capability
over multiple input images. Extensive experiments on a wide range of V- and VL
tasks demonstrate the effectiveness of VistaLLM by achieving consistent
state-of-the-art performance over strong baselines across all downstream tasks.
Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.