GPT-4V(ision)的早期评估
An Early Evaluation of GPT-4V(ision)
October 25, 2023
作者: Yang Wu, Shilong Wang, Hao Yang, Tian Zheng, Hongbo Zhang, Yanyan Zhao, Bing Qin
cs.AI
摘要
本文评估了GPT-4V在视觉理解、语言理解、视觉拼图解决以及深度、热力、视频和音频等其他模态的理解等不同能力。为了评估GPT-4V的表现,我们手动构建了656个测试实例,并对GPT-4V的结果进行了仔细评估。我们的研究结果要点如下:
(1) GPT-4V在英文视觉为中心的基准测试中表现出色,但无法识别图像中的简单中文文本;(2) 在回答涉及性别、种族和年龄等敏感特征的问题时,GPT-4V表现出不一致的拒绝行为;(3) GPT-4V在语言理解任务上的表现比GPT-4(API)差,包括一般语言理解基准测试和视觉常识知识评估基准测试;(4) 少样本提示可以提高GPT-4V在视觉理解和语言理解方面的表现;(5) GPT-4V在找出两个相似图像之间的细微差别和解决简单的数学图片拼图方面遇到困难;(6) GPT-4V在类似于图像的视频和热力等模态任务上表现出非平凡的性能。我们的实验结果揭示了GPT-4V的能力和局限性,希望本文能为GPT-4V的应用和研究提供一些启示。
English
In this paper, we evaluate different abilities of GPT-4V including visual
understanding, language understanding, visual puzzle solving, and understanding
of other modalities such as depth, thermal, video, and audio. To estimate
GPT-4V's performance, we manually construct 656 test instances and carefully
evaluate the results of GPT-4V. The highlights of our findings are as follows:
(1) GPT-4V exhibits impressive performance on English visual-centric benchmarks
but fails to recognize simple Chinese texts in the images; (2) GPT-4V shows
inconsistent refusal behavior when answering questions related to sensitive
traits such as gender, race, and age; (3) GPT-4V obtains worse results than
GPT-4 (API) on language understanding tasks including general language
understanding benchmarks and visual commonsense knowledge evaluation
benchmarks; (4) Few-shot prompting can improve GPT-4V's performance on both
visual understanding and language understanding; (5) GPT-4V struggles to find
the nuances between two similar images and solve the easy math picture puzzles;
(6) GPT-4V shows non-trivial performance on the tasks of similar modalities to
image, such as video and thermal. Our experimental results reveal the ability
and limitations of GPT-4V and we hope our paper can provide some insights into
the application and research of GPT-4V.