微型 LVLM-eHub：與 Bard 進行的早期多模實驗

摘要

最近在大型視覺語言模型（LVLMs）方面的進展已經顯示出在應對複雜多模式任務方面取得了顯著進展。在這些尖端發展中，Google 的 Bard 因其卓越的多模式能力而脫穎而出，促進了對各個領域的全面理解和推理。本研究提出了對 LVLMs 的多模式能力進行早期和全面評估，特別聚焦於 Bard，通過提出 LVLM-eHub 的輕量級變體，即 Tiny LVLM-eHub。與原始版本相比，Tiny LVLM-eHub 具有幾個吸引人的特點。首先，它通過對 42 個標準文本相關視覺基準的定量評估，系統評估了六個類別的多模式能力，包括視覺感知、視覺知識獲取、視覺推理、視覺常識、對象幻覺和具體智能。其次，它通過 ChatGPT Ensemble Evaluation（CEE）對 LVLMs 的預測進行深入分析，從而實現了堅固且準確的評估，並與單詞匹配方法相比展現出更好的與人類評估的一致性。第三，它僅包含 2.1K 張圖像-文本對，便於從業人員評估他們自己的離線 LVLMs。通過廣泛的實驗分析，本研究表明 Bard 在大多數多模式能力方面優於先前的 LVLMs，除了對象幻覺，Bard 仍然容易受到影響。Tiny LVLM-eHub 為各種 LVLMs 提供了基準評估，並鼓勵針對推進多模式技術的創新策略。我們的項目公開可用於 https://github.com/OpenGVLab/Multi-Modality-Arena。

English

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress in tackling complex multimodal tasks. Among these cutting-edge developments, Google's Bard stands out for its remarkable multimodal capabilities, promoting comprehensive comprehension and reasoning across various domains. This work presents an early and holistic evaluation of LVLMs' multimodal abilities, with a particular focus on Bard, by proposing a lightweight variant of LVLM-eHub, named Tiny LVLM-eHub. In comparison to the vanilla version, Tiny LVLM-eHub possesses several appealing properties. Firstly, it provides a systematic assessment of six categories of multimodal capabilities, including visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence, through quantitative evaluation of 42 standard text-related visual benchmarks. Secondly, it conducts an in-depth analysis of LVLMs' predictions using the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and accurate evaluation and exhibits improved alignment with human evaluation compared to the word matching approach. Thirdly, it comprises a mere 2.1K image-text pairs, facilitating ease of use for practitioners to evaluate their own offline LVLMs. Through extensive experimental analysis, this study demonstrates that Bard outperforms previous LVLMs in most multimodal capabilities except object hallucination, to which Bard is still susceptible. Tiny LVLM-eHub serves as a baseline evaluation for various LVLMs and encourages innovative strategies aimed at advancing multimodal techniques. Our project is publicly available at https://github.com/OpenGVLab/Multi-Modality-Arena.

微型 LVLM-eHub：與 Bard 進行的早期多模實驗

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

摘要

Support