GPT-4V(ision)의 초기 평가

초록

본 논문에서는 GPT-4V의 다양한 능력, 즉 시각 이해, 언어 이해, 시각 퍼즐 해결, 그리고 깊이, 열화상, 비디오, 오디오와 같은 다른 모달리티에 대한 이해를 평가합니다. GPT-4V의 성능을 추정하기 위해, 우리는 수동으로 656개의 테스트 인스턴스를 구성하고 GPT-4V의 결과를 신중하게 평가했습니다. 우리의 주요 발견 사항은 다음과 같습니다: (1) GPT-4V는 영어 중심의 시각 벤치마크에서 인상적인 성능을 보이지만, 이미지 내의 간단한 중국어 텍스트를 인식하지 못합니다; (2) GPT-4V는 성별, 인종, 나이와 같은 민감한 특성과 관련된 질문에 대해 일관되지 않은 거부 행동을 보입니다; (3) GPT-4V는 일반 언어 이해 벤치마크와 시각 상식 지식 평가 벤치마크를 포함한 언어 이해 작업에서 GPT-4(API)보다 더 나쁜 결과를 얻습니다; (4) Few-shot 프롬프팅은 GPT-4V의 시각 이해와 언어 이해 모두에서 성능을 향상시킬 수 있습니다; (5) GPT-4V는 두 유사한 이미지 간의 미묘한 차이를 찾고 간단한 수학 그림 퍼즐을 해결하는 데 어려움을 겪습니다; (6) GPT-4V는 비디오와 열화상과 같은 이미지와 유사한 모달리티 작업에서 상당한 성능을 보입니다. 우리의 실험 결과는 GPT-4V의 능력과 한계를 보여주며, 본 논문이 GPT-4V의 응용 및 연구에 대한 통찰을 제공할 수 있기를 바랍니다.

English

In this paper, we evaluate different abilities of GPT-4V including visual understanding, language understanding, visual puzzle solving, and understanding of other modalities such as depth, thermal, video, and audio. To estimate GPT-4V's performance, we manually construct 656 test instances and carefully evaluate the results of GPT-4V. The highlights of our findings are as follows: (1) GPT-4V exhibits impressive performance on English visual-centric benchmarks but fails to recognize simple Chinese texts in the images; (2) GPT-4V shows inconsistent refusal behavior when answering questions related to sensitive traits such as gender, race, and age; (3) GPT-4V obtains worse results than GPT-4 (API) on language understanding tasks including general language understanding benchmarks and visual commonsense knowledge evaluation benchmarks; (4) Few-shot prompting can improve GPT-4V's performance on both visual understanding and language understanding; (5) GPT-4V struggles to find the nuances between two similar images and solve the easy math picture puzzles; (6) GPT-4V shows non-trivial performance on the tasks of similar modalities to image, such as video and thermal. Our experimental results reveal the ability and limitations of GPT-4V and we hope our paper can provide some insights into the application and research of GPT-4V.

GPT-4V(ision)의 초기 평가

An Early Evaluation of GPT-4V(ision)

초록

Support