LLaVA-Grounding: 대규모 멀티모달 모델을 활용한 지면 시각적 채팅

초록

최근 대규모 다중 모달 모델(Large Multi-Modal Models, LMMs)의 상당한 발전과 함께, 시각적 채팅에서의 그라운딩 능력의 중요성이 점차 부각되고 있다. LMMs가 그라운딩을 지원할 수 있도록 최근 여러 노력이 있었음에도 불구하고, 그라운딩과 채팅 능력은 일반적으로 분리되어 있으며, 그라운딩을 요청받을 때 채팅 성능이 급격히 저하된다. 이 문제는 그라운딩된 시각적 채팅(Grounded Visual Chat, GVC)을 위한 데이터셋의 부재에서 기인한다. 기존의 그라운딩 데이터셋은 짧은 캡션만을 포함하고 있다. 이 문제를 해결하기 위해, 우리는 그라운딩과 채팅 능력을 결합할 수 있는 GVC 데이터를 생성하였다. GVC 능력을 더 잘 평가하기 위해, 우리는 Grounding-Bench라는 벤치마크를 도입하였다. 또한, 세그멘테이션 모델과 언어 모델을 연결함으로써 GVC 및 다양한 유형의 시각적 프롬프트를 지원할 수 있는 모델 설계를 제안하였다. 실험 결과는 우리의 모델이 Grounding-Bench에서 다른 LMMs를 능가함을 보여준다. 더 나아가, 우리의 모델은 RefCOCO/+/g 및 Flickr30K Entities와 같은 클래식 그라운딩 벤치마크에서도 경쟁력 있는 성능을 달성한다. 우리의 코드는 https://github.com/UX-Decoder/LLaVA-Grounding 에 공개될 예정이다.

English

With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at https://github.com/UX-Decoder/LLaVA-Grounding .

LLaVA-Grounding: 대규모 멀티모달 모델을 활용한 지면 시각적 채팅

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

초록

Support