VHELM: Una Valutazione Olistica dei Modelli di Visione Linguistica

Abstract

I benchmark attuali per valutare i modelli visione-linguaggio (VLMs) si concentrano spesso sulle loro capacità di percezione o risoluzione di problemi e trascurano altri aspetti critici come equità, multilinguismo o tossicità. Inoltre, differiscono nei loro procedimenti di valutazione e nell'ambito della valutazione, rendendo difficile confrontare i modelli. Per affrontare queste problematiche, estendiamo il framework HELM ai VLMs per presentare la Valutazione Olistica dei Modelli Visione-Linguaggio (VHELM). VHELM aggrega vari set di dati per coprire uno o più dei 9 aspetti: percezione visiva, conoscenza, ragionamento, pregiudizi, equità, multilinguismo, robustezza, tossicità e sicurezza. In questo modo, otteniamo una visione completa e multidimensionale delle capacità dei VLMs su questi fattori importanti. Inoltre, standardizziamo i parametri di inferenza standard, i metodi di stimolo e le metriche di valutazione per consentire confronti equi tra i modelli. Il nostro framework è progettato per essere leggero e automatico in modo che le esecuzioni di valutazione siano economiche e veloci. La nostra esecuzione iniziale valuta 22 VLMs su 21 set di dati esistenti per fornire uno snapshot olistico dei modelli. Scopriamo nuove scoperte chiave, come il fatto che i modelli focalizzati sull'efficienza (ad esempio, Claude 3 Haiku o Gemini 1.5 Flash) si comportano significativamente peggio rispetto ai loro modelli completi (ad esempio, Claude 3 Opus o Gemini 1.5 Pro) nel benchmark dei pregiudizi ma non quando valutati sugli altri aspetti. Per trasparenza, rilasciamo le generazioni di modelli grezzi e i risultati completi sul nostro sito web (https://crfm.stanford.edu/helm/vhelm/v2.0.1). VHELM è pensato per essere un benchmark in evoluzione, e speriamo di continuare ad aggiungere nuovi set di dati e modelli nel tempo.

English

Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs to present the Holistic Evaluation of Vision Language Models (VHELM). VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. In doing so, we produce a comprehensive, multi-dimensional view of the capabilities of the VLMs across these important factors. In addition, we standardize the standard inference parameters, methods of prompting, and evaluation metrics to enable fair comparisons across models. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast. Our initial run evaluates 22 VLMs on 21 existing datasets to provide a holistic snapshot of the models. We uncover new key findings, such as the fact that efficiency-focused models (e.g., Claude 3 Haiku or Gemini 1.5 Flash) perform significantly worse than their full models (e.g., Claude 3 Opus or Gemini 1.5 Pro) on the bias benchmark but not when evaluated on the other aspects. For transparency, we release the raw model generations and complete results on our website (https://crfm.stanford.edu/helm/vhelm/v2.0.1). VHELM is intended to be a living benchmark, and we hope to continue adding new datasets and models over time.

VHELM: Una Valutazione Olistica dei Modelli di Visione Linguistica

VHELM: A Holistic Evaluation of Vision Language Models

Abstract

Support