모든 작업에 능한 만능수, 다수 분야의 전문가: 범용적 Coarse-to-Fine 비전-언어 모델 설계

초록

대규모 언어 모델(LLMs)이 시각적 입력을 처리할 수 있는 능력은 다양한 비전-언어(VL) 작업을 지시 튜닝을 통해 통합하는 범용 비전 시스템의 등장을 이끌었습니다. 그러나 비전 도메인에서 입력-출력 형식의 엄청난 다양성으로 인해, 기존의 범용 모델들은 세분화 작업과 다중 이미지 입력을 거시적 수준의 작업과 단일 프레임워크로 통합하는 데 실패했습니다. 본 연구에서는 단일 및 다중 입력 이미지에 걸친 거시적 및 미시적 VL 작업을 통합 프레임워크로 처리하는 강력한 시각 시스템인 VistaLLM을 소개합니다. VistaLLM은 작업 설명을 사용하여 전역 임베딩을 필터링하여 수많은 이미지에서 압축되고 정제된 특징을 추출하는 지시 기반 이미지 토크나이저를 활용합니다. 또한, VistaLLM은 이진 세분화 마스크를 시퀀스로 표현하기 위해 그레이디언트 인식 적응형 샘플링 기법을 사용하여 이전에 사용된 균일 샘플링을 크게 개선했습니다. VistaLLM의 원하는 능력을 강화하기 위해, 6.8M 샘플로 구성된 포괄적인 거시적에서 미시적 지시 튜닝 데이터셋인 CoinIt를 구축했습니다. 또한, 다중 이미지 그라운딩 데이터셋의 부족을 해결하기 위해, 다중 입력 이미지에 대한 모델의 추론 및 그라운딩 능력을 향상시키는 새로운 작업인 AttCoSeg(속성 수준 공동 세분화)를 도입했습니다. 다양한 V 및 VL 작업에 대한 광범위한 실험을 통해 VistaLLM의 효과를 입증하였으며, 모든 하위 작업에서 강력한 베이스라인을 일관되게 뛰어넘는 최첨단 성능을 달성했습니다. 우리의 프로젝트 페이지는 https://shramanpramanick.github.io/VistaLLM/에서 확인할 수 있습니다.

English

The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.

모든 작업에 능한 만능수, 다수 분야의 전문가: 범용적 Coarse-to-Fine 비전-언어 모델 설계

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

초록

Support