Uma Avaliação Inicial do GPT-4V(ision)

Resumo

Neste artigo, avaliamos diferentes habilidades do GPT-4V, incluindo compreensão visual, compreensão linguística, resolução de quebra-cabeças visuais e compreensão de outras modalidades, como profundidade, térmica, vídeo e áudio. Para estimar o desempenho do GPT-4V, construímos manualmente 656 instâncias de teste e avaliamos cuidadosamente os resultados do GPT-4V. Os destaques de nossas descobertas são os seguintes: (1) O GPT-4V exibe um desempenho impressionante em benchmarks visuais centrados no inglês, mas falha em reconhecer textos simples em chinês nas imagens; (2) O GPT-4V mostra um comportamento de recusa inconsistente ao responder perguntas relacionadas a características sensíveis, como gênero, raça e idade; (3) O GPT-4V obtém resultados piores do que o GPT-4 (API) em tarefas de compreensão linguística, incluindo benchmarks gerais de compreensão linguística e benchmarks de avaliação de conhecimento de senso comum visual; (4) O prompting few-shot pode melhorar o desempenho do GPT-4V tanto na compreensão visual quanto na compreensão linguística; (5) O GPT-4V tem dificuldade em encontrar as nuances entre duas imagens semelhantes e resolver quebra-cabeças matemáticos visuais simples; (6) O GPT-4V mostra um desempenho não trivial em tarefas de modalidades semelhantes à imagem, como vídeo e térmica. Nossos resultados experimentais revelam a capacidade e as limitações do GPT-4V, e esperamos que nosso artigo possa fornecer alguns insights sobre a aplicação e pesquisa do GPT-4V.

English

In this paper, we evaluate different abilities of GPT-4V including visual understanding, language understanding, visual puzzle solving, and understanding of other modalities such as depth, thermal, video, and audio. To estimate GPT-4V's performance, we manually construct 656 test instances and carefully evaluate the results of GPT-4V. The highlights of our findings are as follows: (1) GPT-4V exhibits impressive performance on English visual-centric benchmarks but fails to recognize simple Chinese texts in the images; (2) GPT-4V shows inconsistent refusal behavior when answering questions related to sensitive traits such as gender, race, and age; (3) GPT-4V obtains worse results than GPT-4 (API) on language understanding tasks including general language understanding benchmarks and visual commonsense knowledge evaluation benchmarks; (4) Few-shot prompting can improve GPT-4V's performance on both visual understanding and language understanding; (5) GPT-4V struggles to find the nuances between two similar images and solve the easy math picture puzzles; (6) GPT-4V shows non-trivial performance on the tasks of similar modalities to image, such as video and thermal. Our experimental results reveal the ability and limitations of GPT-4V and we hope our paper can provide some insights into the application and research of GPT-4V.

Uma Avaliação Inicial do GPT-4V(ision)

An Early Evaluation of GPT-4V(ision)

Resumo

Support