萬事通，眾多專長：設計通用粗到細視覺語言模型

摘要

大型語言模型（LLMs）處理視覺輸入的能力已經催生了通用視覺系統，通過指導調整統一各種視覺語言（VL）任務。然而，由於視覺領域中輸入輸出格式的巨大多樣性，現有的通用模型無法成功將分割和多圖像輸入與粗粒度任務成功整合到單一框架中。在這項工作中，我們介紹了 VistaLLM，這是一個強大的視覺系統，使用統一框架處理單張和多張輸入圖像上的粗粒度和細粒度 VL 任務。VistaLLM 使用指導圖像分詞器，通過任務描述篩選全局嵌入，從眾多圖像中提取壓縮和精煉特徵。此外，VistaLLM 使用梯度感知自適應採樣技術，將二元分割遮罩表示為序列，顯著改進了以前使用的均勻採樣方法。為了增強 VistaLLM 的所需功能，我們精心編輯了 CoinIt，一個包含 680 萬樣本的全面粗粒度到細粒度指導調整數據集。我們還通過引入一個新的任務 AttCoSeg（屬性級聯合分割）來解決多圖像基礎數據集的缺乏，這進一步提升了模型對多個輸入圖像的推理和基礎定位能力。在各種 V 和 VL 任務上進行的大量實驗表明，VistaLLM 的有效性，它在所有下游任務中都實現了與強基線一致的最新性能。我們的項目頁面位於 https://shramanpramanick.github.io/VistaLLM/。

English

The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.

萬事通，眾多專長：設計通用粗到細視覺語言模型

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

摘要

Support