VCoder：多模式大型語言模型的多功能視覺編碼器

摘要

人類擁有卓越的「視覺感知」技能，能夠看到並理解所見之物，幫助他們理解視覺世界，進而進行推理。最近，多模式大型語言模型（MLLM）在視覺-語言任務上取得了令人印象深刻的表現，包括視覺問答、圖像標註、視覺推理和圖像生成等。然而，當要求現有的MLLM系統識別或計數（感知）給定圖像中的實體時，存在問題。為了開發一個準確的MLLM系統，用於感知和推理，我們建議使用多模式LLM的感知眼睛——多功能視覺編碼器（VCoder）。我們將VCoder與感知模態（如分割或深度圖）相結合，以提高MLLM的感知能力。其次，我們利用COCO的圖像和現成的視覺感知模型的輸出，創建了我們的COCO分割文本（COST）數據集，用於訓練和評估MLLM對象感知任務。第三，我們引入了評估MLLM對象感知能力的指標，應用於我們的COST數據集。最後，我們提供了豐富的實驗證據，證明了VCoder在對象級感知技能方面優於現有的多模式LLM，包括GPT-4V。我們將我們的數據集、代碼和模型開源，以促進研究。我們在https://github.com/SHI-Labs/VCoder 開源我們的代碼。

English

Humans possess the remarkable skill of Visual Perception, the ability to see and understand the seen, helping them make sense of the visual world and, in turn, reason. Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks ranging from visual question-answering and image captioning to visual reasoning and image generation. However, when prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail. Working towards developing an accurate MLLM system for perception and reasoning, we propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the VCoder with perception modalities such as segmentation or depth maps, improving the MLLM's perception abilities. Secondly, we leverage the images from COCO and outputs from off-the-shelf vision perception models to create our COCO Segmentation Text (COST) dataset for training and evaluating MLLMs on the object perception task. Thirdly, we introduce metrics to assess the object perception abilities in MLLMs on our COST dataset. Lastly, we provide extensive experimental evidence proving the VCoder's improved object-level perception skills over existing Multimodal LLMs, including GPT-4V. We open-source our dataset, code, and models to promote research. We open-source our code at https://github.com/SHI-Labs/VCoder

VCoder：多模式大型語言模型的多功能視覺編碼器

VCoder: Versatile Vision Encoders for Multimodal Large Language Models

摘要

Support