Tiny LVLM-eHub: Bard를 활용한 초기 멀티모달 실험

초록

대규모 시각-언어 모델(LVLMs)의 최근 발전은 복잡한 다중모달 작업을 해결하는 데 있어 상당한 진전을 보여주었습니다. 이러한 최첨단 개발 중에서도 Google의 Bard는 다양한 영역에서 포괄적인 이해와 추론을 촉진하는 놀라운 다중모달 능력으로 두각을 나타내고 있습니다. 본 연구는 Tiny LVLM-eHub라는 LVLM-eHub의 경량화된 변형을 제안함으로써, 특히 Bard에 초점을 맞춰 LVLMs의 다중모달 능력에 대한 초기적이고 종합적인 평가를 제시합니다. 기존 버전과 비교하여 Tiny LVLM-eHub는 몇 가지 매력적인 특성을 가지고 있습니다. 첫째, 42개의 표준 텍스트 관련 시각 벤치마크를 통해 시각 인지, 시각 지식 습득, 시각 추론, 시각 상식, 객체 환각, 그리고 구현된 지능 등 6가지 범주의 다중모달 능력에 대한 체계적인 평가를 제공합니다. 둘째, ChatGPT 앙상블 평가(CEE)를 사용하여 LVLMs의 예측에 대한 심층 분석을 수행함으로써, 단어 매칭 접근 방식에 비해 더 강력하고 정확한 평가를 제공하며 인간 평가와의 일치도를 개선합니다. 셋째, 단 2.1K개의 이미지-텍스트 쌍으로 구성되어 있어 실무자가 자신의 오프라인 LVLMs를 쉽게 평가할 수 있도록 합니다. 광범위한 실험적 분석을 통해 본 연구는 Bard가 객체 환각을 제외한 대부분의 다중모달 능력에서 이전의 LVLMs를 능가한다는 것을 입증합니다. Tiny LVLM-eHub는 다양한 LVLMs에 대한 기준 평가를 제공하며 다중모달 기술을 발전시키기 위한 혁신적인 전략을 장려합니다. 우리의 프로젝트는 https://github.com/OpenGVLab/Multi-Modality-Arena에서 공개적으로 이용 가능합니다.

English

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress in tackling complex multimodal tasks. Among these cutting-edge developments, Google's Bard stands out for its remarkable multimodal capabilities, promoting comprehensive comprehension and reasoning across various domains. This work presents an early and holistic evaluation of LVLMs' multimodal abilities, with a particular focus on Bard, by proposing a lightweight variant of LVLM-eHub, named Tiny LVLM-eHub. In comparison to the vanilla version, Tiny LVLM-eHub possesses several appealing properties. Firstly, it provides a systematic assessment of six categories of multimodal capabilities, including visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence, through quantitative evaluation of 42 standard text-related visual benchmarks. Secondly, it conducts an in-depth analysis of LVLMs' predictions using the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and accurate evaluation and exhibits improved alignment with human evaluation compared to the word matching approach. Thirdly, it comprises a mere 2.1K image-text pairs, facilitating ease of use for practitioners to evaluate their own offline LVLMs. Through extensive experimental analysis, this study demonstrates that Bard outperforms previous LVLMs in most multimodal capabilities except object hallucination, to which Bard is still susceptible. Tiny LVLM-eHub serves as a baseline evaluation for various LVLMs and encourages innovative strategies aimed at advancing multimodal techniques. Our project is publicly available at https://github.com/OpenGVLab/Multi-Modality-Arena.

Tiny LVLM-eHub: Bard를 활용한 초기 멀티모달 실험

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

초록

Support