多才多艺，多方面精通：设计通用粗到细视觉-语言模型

摘要

大型语言模型（LLMs）处理视觉输入的能力催生了通用视觉系统，通过指导微调统一各种视觉-语言（VL）任务。然而，由于视觉领域输入输出格式的巨大多样性，现有的通用模型未能成功将分割和多图像输入与粗粒度任务整合到一个框架中。在这项工作中，我们介绍了VistaLLM，一个强大的视觉系统，利用统一框架处理单个和多个输入图像上的粗粒度和细粒度VL任务。VistaLLM利用指导图像标记器，通过任务描述筛选全局嵌入，从众多图像中提取压缩和精炼特征。此外，VistaLLM采用梯度感知自适应采样技术，将二进制分割掩模表示为序列，显著改善了先前使用的均匀采样。为了增强VistaLLM的期望能力，我们精心策划了CoinIt，一个包含680万样本的全面粗粒度到细粒度指导微调数据集。我们还通过引入一项新任务AttCoSeg（属性级联合分割）来解决缺乏多图像基准数据集的问题，这有助于提升模型在多个输入图像上的推理和基准能力。在广泛的V-和VL任务上进行的大量实验表明，VistaLLM的有效性，通过在所有下游任务中稳定超越强基线，取得了一致的最新性能。我们的项目页面位于https://shramanpramanick.github.io/VistaLLM/。

English

The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.

多才多艺，多方面精通：设计通用粗到细视觉-语言模型

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

摘要

Support