ChatPaper.aiChatPaper

引导视觉问答中的视觉-语言模型选择 跨任务、领域和知识类型

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

September 14, 2024
作者: Neelabh Sinha, Vinija Jain, Aman Chadha
cs.AI

摘要

视觉问答(VQA)已成为多个应用中的关键用例,以帮助用户体验,特别是在视觉-语言模型(VLMs)在零样本推理中取得良好结果之后。但在实际设置中使用标准化框架评估不同VLMs以满足应用需求仍具有挑战性。本文介绍了一个针对实际设置中VQA任务评估的全面框架。我们提出了一个新颖的数据集,源自已建立的VQA基准,标注了任务类型、应用领域和知识类型,这三个任务可能有所不同的关键实际方面。我们还介绍了GoEval,这是一个使用GPT-4o开发的多模态评估指标,与人类判断的相关系数达到了56.71%。我们对十种最先进的VLMs进行的实验表明,没有一种单一模型能在所有情况下表现优异,因此适当的选择是关键的设计决策。专有模型如Gemini-1.5-Pro和GPT-4o-mini通常胜过其他模型,尽管像InternVL-2-8B和CogVLM-2-Llama-3-19B这样的开源模型在特定情境中展现出竞争优势,同时提供额外的优势。这项研究指导了基于具体任务需求和资源约束选择VLMs,并且也可以扩展到其他视觉-语言任务。
English
Visual Question-Answering (VQA) has become a key use-case in several applications to aid user experience, particularly after Vision-Language Models (VLMs) achieving good results in zero-shot inference. But evaluating different VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper introduces a comprehensive framework for evaluating VLMs tailored to VQA tasks in practical settings. We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, three key practical aspects on which tasks can vary. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with ten state-of-the-art VLMs reveals that no single model excelling universally, making appropriate selection a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, though open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B demonstrate competitive strengths in specific contexts, while providing additional advantages. This study guides the selection of VLMs based on specific task requirements and resource constraints, and can also be extended to other vision-language tasks.

Summary

AI-Generated Summary

PDF92November 16, 2024