BuboGPT: 다중 모달 LLM에서 시각적 접지 기능 활성화

초록

LLM(대형 언어 모델)은 특히 지시 따르기 데이터를 활용하여 인간과 언어를 통해 상호작용하는 데 있어 뛰어난 능력을 보여주고 있습니다. MiniGPT-4, LLaVA, X-LLM과 같은 최근의 LLM 발전은 이미지, 비디오, 음성 등 다중 모달 입력을 통합함으로써 이러한 능력을 더욱 확장하고 있습니다. 이러한 LLM은 주어진 모달 신호에 대한 정확하고 세부적인 언어 이해를 생성하는 데 효과적이지만, 입력의 특정 부분을 구체적으로 연결하는 능력을 포기함으로써 거친 수준의 매핑만을 구성합니다. 그러나 텍스트와 다른 모달리티 간의 명시적이고 유익한 대응 관계는 사용자 경험을 개선할 뿐만 아니라 다중 모달 LLM의 응용 시나리오를 확장하는 데도 도움이 될 것입니다. 따라서 우리는 시각, 청각, 언어 간의 교차 모달 상호작용을 수행할 수 있는 시각적 접지(visual grounding) 기능을 갖춘 다중 모달 LLM인 BuboGPT를 제안합니다. BuboGPT는 시각적 객체와 주어진 다른 모달리티에 대한 세밀한 이해를 제공하며, 특정 객체에 대한 응답이나 설명을 생성할 때 이미지 내에서 해당 객체의 정확한 위치를 지적할 수 있습니다. 우리의 기여는 두 가지로 요약됩니다: 1) SAM 기반의 즉시 사용 가능한 시각적 접지 모듈로, 문장 내의 개체를 추출하고 이미지에서 해당 마스크를 찾습니다. 2) 텍스트-이미지-오디오의 공동 이해를 부여하기 위한 두 단계의 학습 방식과 지시 데이터셋. 실험 결과, BuboGPT는 인간과의 상호작용 중에 인상적인 다중 모달리티 이해 및 시각적 접지 능력을 달성하며, 정렬 여부와 관계없이 임의의 모달리티 조합이 제공될 때도 일관되게 우수한 성능을 보입니다. 우리의 코드, 모델 및 데이터셋은 https://bubo-gpt.github.io에서 확인할 수 있습니다.

English

LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only constructing a coarse-grained mapping. However, explicit and informative correspondence between text and other modalities will not only improve the user experience but also help to expand the application scenario of multi-modal LLMs. Therefore, we propose BuboGPT, a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language, providing fine-grained understanding of visual objects and other given modalities. As a result, BuboGPT is able to point out the specific location of an object in the image, when it is generating response or description for that object. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. 2) A two-stage training scheme and instruction dataset to endow joint text-image-audio understanding. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human. It performs consistently well when provided by arbitrary modality combinations (either aligned or unaligned). Our code, model and dataset are available at https://bubo-gpt.github.io .

BuboGPT: 다중 모달 LLM에서 시각적 접지 기능 활성화

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

초록

Support