微型 LVLM-eHub:與 Bard 進行的早期多模實驗
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
August 7, 2023
作者: Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, Ping Luo
cs.AI
摘要
最近在大型視覺語言模型(LVLMs)方面的進展已經顯示出在應對複雜多模式任務方面取得了顯著進展。在這些尖端發展中,Google 的 Bard 因其卓越的多模式能力而脫穎而出,促進了對各個領域的全面理解和推理。本研究提出了對 LVLMs 的多模式能力進行早期和全面評估,特別聚焦於 Bard,通過提出 LVLM-eHub 的輕量級變體,即 Tiny LVLM-eHub。與原始版本相比,Tiny LVLM-eHub 具有幾個吸引人的特點。首先,它通過對 42 個標準文本相關視覺基準的定量評估,系統評估了六個類別的多模式能力,包括視覺感知、視覺知識獲取、視覺推理、視覺常識、對象幻覺和具體智能。其次,它通過 ChatGPT Ensemble Evaluation(CEE)對 LVLMs 的預測進行深入分析,從而實現了堅固且準確的評估,並與單詞匹配方法相比展現出更好的與人類評估的一致性。第三,它僅包含 2.1K 張圖像-文本對,便於從業人員評估他們自己的離線 LVLMs。通過廣泛的實驗分析,本研究表明 Bard 在大多數多模式能力方面優於先前的 LVLMs,除了對象幻覺,Bard 仍然容易受到影響。Tiny LVLM-eHub 為各種 LVLMs 提供了基準評估,並鼓勵針對推進多模式技術的創新策略。我們的項目公開可用於 https://github.com/OpenGVLab/Multi-Modality-Arena。
English
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated
significant progress in tackling complex multimodal tasks. Among these
cutting-edge developments, Google's Bard stands out for its remarkable
multimodal capabilities, promoting comprehensive comprehension and reasoning
across various domains. This work presents an early and holistic evaluation of
LVLMs' multimodal abilities, with a particular focus on Bard, by proposing a
lightweight variant of LVLM-eHub, named Tiny LVLM-eHub. In comparison to the
vanilla version, Tiny LVLM-eHub possesses several appealing properties.
Firstly, it provides a systematic assessment of six categories of multimodal
capabilities, including visual perception, visual knowledge acquisition, visual
reasoning, visual commonsense, object hallucination, and embodied intelligence,
through quantitative evaluation of 42 standard text-related visual
benchmarks. Secondly, it conducts an in-depth analysis of LVLMs' predictions
using the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and
accurate evaluation and exhibits improved alignment with human evaluation
compared to the word matching approach. Thirdly, it comprises a mere 2.1K
image-text pairs, facilitating ease of use for practitioners to evaluate their
own offline LVLMs. Through extensive experimental analysis, this study
demonstrates that Bard outperforms previous LVLMs in most multimodal
capabilities except object hallucination, to which Bard is still susceptible.
Tiny LVLM-eHub serves as a baseline evaluation for various LVLMs and encourages
innovative strategies aimed at advancing multimodal techniques. Our project is
publicly available at https://github.com/OpenGVLab/Multi-Modality-Arena.