微型LVLM-eHub：与Bard进行的早期多模态实验

摘要

最近大规模视觉语言模型（LVLMs）的最新进展展示了在解决复杂多模态任务方面取得的显著进展。在这些尖端发展中，谷歌的Bard因其卓越的多模态能力而脱颖而出，促进了跨不同领域的全面理解和推理。本研究通过提出LVLMs多模态能力的早期和全面评估，特别关注Bard，提出了LVLM-eHub的轻量级变体，命名为Tiny LVLM-eHub。与原始版本相比，Tiny LVLM-eHub具有几个吸引人的特性。首先，通过对42个标准文本相关视觉基准的定量评估，它提供了对六类多模态能力的系统评估，包括视觉感知、视觉知识获取、视觉推理、视觉常识、物体幻觉和具身智能。其次，它通过ChatGPT集成评估（CEE）对LVLMs的预测进行了深入分析，从而实现了强大而准确的评估，并与单词匹配方法相比展现出更好的与人类评估的一致性。第三，它仅包含2.1K图像文本对，便于从业者评估其自己的离线LVLMs。通过广泛的实验分析，本研究表明，Bard在大多数多模态能力方面优于先前的LVLMs，除了物体幻觉，Bard仍然容易受到影响。Tiny LVLM-eHub为各种LVLMs提供了基准评估，并鼓励旨在推进多模态技术的创新策略。我们的项目可在https://github.com/OpenGVLab/Multi-Modality-Arena 公开获取。

English

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress in tackling complex multimodal tasks. Among these cutting-edge developments, Google's Bard stands out for its remarkable multimodal capabilities, promoting comprehensive comprehension and reasoning across various domains. This work presents an early and holistic evaluation of LVLMs' multimodal abilities, with a particular focus on Bard, by proposing a lightweight variant of LVLM-eHub, named Tiny LVLM-eHub. In comparison to the vanilla version, Tiny LVLM-eHub possesses several appealing properties. Firstly, it provides a systematic assessment of six categories of multimodal capabilities, including visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence, through quantitative evaluation of 42 standard text-related visual benchmarks. Secondly, it conducts an in-depth analysis of LVLMs' predictions using the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and accurate evaluation and exhibits improved alignment with human evaluation compared to the word matching approach. Thirdly, it comprises a mere 2.1K image-text pairs, facilitating ease of use for practitioners to evaluate their own offline LVLMs. Through extensive experimental analysis, this study demonstrates that Bard outperforms previous LVLMs in most multimodal capabilities except object hallucination, to which Bard is still susceptible. Tiny LVLM-eHub serves as a baseline evaluation for various LVLMs and encourages innovative strategies aimed at advancing multimodal techniques. Our project is publicly available at https://github.com/OpenGVLab/Multi-Modality-Arena.

微型LVLM-eHub：与Bard进行的早期多模态实验

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

摘要

Support