ChatPaper.aiChatPaper

LVLM-Intrepret:一种用于大型视觉-语言模型的可解释性工具

LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models

April 3, 2024
作者: Gabriela Ben Melech Stan, Raanan Yehezkel Rohekar, Yaniv Gurwicz, Matthew Lyle Olson, Anahita Bhiwandiwalla, Estelle Aflalo, Chenfei Wu, Nan Duan, Shao-Yen Tseng, Vasudev Lal
cs.AI

摘要

在人工智能不断发展的领域中,多模态大型语言模型正成为一个重要的研究领域。这些模型结合了各种形式的数据输入,变得越来越受欢迎。然而,理解它们的内部机制仍然是一个复杂的任务。在可解释性工具和机制领域已经取得了许多进展,但仍有许多待探索之处。在这项工作中,我们提出了一个新颖的交互式应用程序,旨在理解大型视觉-语言模型的内部机制。我们的界面旨在增强图像补丁的可解释性,这对于生成答案至关重要,并评估语言模型在图像中对其输出的基础。通过我们的应用程序,用户可以系统地调查模型并揭示系统限制,为提升系统能力铺平道路。最后,我们展示了一个案例研究,说明我们的应用程序如何帮助理解一种流行的大型多模态模型LLaVA中的失败机制。
English
In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. However, understanding their internal mechanisms remains a complex task. Numerous advancements have been made in the field of explainability tools and mechanisms, yet there is still much to explore. In this work, we present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer, and assess the efficacy of the language model in grounding its output in the image. With our application, a user can systematically investigate the model and uncover system limitations, paving the way for enhancements in system capabilities. Finally, we present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.

Summary

AI-Generated Summary

PDF271December 15, 2024