ChatPaper.aiChatPaper

GPT-4V(ision) 的初步評估

An Early Evaluation of GPT-4V(ision)

October 25, 2023
作者: Yang Wu, Shilong Wang, Hao Yang, Tian Zheng, Hongbo Zhang, Yanyan Zhao, Bing Qin
cs.AI

摘要

本文評估了GPT-4V在視覺理解、語言理解、視覺拼圖解決以及理解深度、熱度、視頻和音頻等其他模態方面的不同能力。為了評估GPT-4V的表現,我們手動構建了656個測試實例並仔細評估了GPT-4V的結果。我們的研究發現要點如下: (1) GPT-4V在英文視覺中心基準測試中表現出色,但無法識別圖像中的簡單中文文本;(2) 當回答涉及性別、種族和年齡等敏感特徵的問題時,GPT-4V展現出不一致的拒絕行為;(3) GPT-4V在語言理解任務上的表現比GPT-4 (API)差,包括一般語言理解基準測試和視覺常識知識評估基準測試;(4) 少量提示可以提高GPT-4V在視覺理解和語言理解方面的表現;(5) GPT-4V難以找出兩個相似圖像之間的細微差異並解決簡單的數學圖片拼圖;(6) GPT-4V在與圖像類似的模態任務上表現出不俗的表現,如視頻和熱度。我們的實驗結果揭示了GPT-4V的能力和局限性,希望本文能為GPT-4V的應用和研究提供一些見解。
English
In this paper, we evaluate different abilities of GPT-4V including visual understanding, language understanding, visual puzzle solving, and understanding of other modalities such as depth, thermal, video, and audio. To estimate GPT-4V's performance, we manually construct 656 test instances and carefully evaluate the results of GPT-4V. The highlights of our findings are as follows: (1) GPT-4V exhibits impressive performance on English visual-centric benchmarks but fails to recognize simple Chinese texts in the images; (2) GPT-4V shows inconsistent refusal behavior when answering questions related to sensitive traits such as gender, race, and age; (3) GPT-4V obtains worse results than GPT-4 (API) on language understanding tasks including general language understanding benchmarks and visual commonsense knowledge evaluation benchmarks; (4) Few-shot prompting can improve GPT-4V's performance on both visual understanding and language understanding; (5) GPT-4V struggles to find the nuances between two similar images and solve the easy math picture puzzles; (6) GPT-4V shows non-trivial performance on the tasks of similar modalities to image, such as video and thermal. Our experimental results reveal the ability and limitations of GPT-4V and we hope our paper can provide some insights into the application and research of GPT-4V.
PDF221December 15, 2024