VLMEvalKit：用于评估大型多模态模型的开源工具包

摘要

我们提出了VLMEvalKit：一个基于PyTorch的用于评估大型多模态模型的开源工具包。该工具旨在为研究人员和开发人员提供一个用户友好且全面的框架，用于评估现有的多模态模型并发布可复现的评估结果。在VLMEvalKit中，我们实现了超过70种不同的大型多模态模型，包括专有API和开源模型，以及超过20种不同的多模态基准测试。通过实现单一接口，新模型可以轻松添加到工具包中，同时工具包会自动处理其余的工作负载，包括数据准备、分布式推断、预测后处理和指标计算。虽然该工具包目前主要用于评估大型视觉-语言模型，但其设计与未来更新兼容，可以整合其他模态，如音频和视频。基于使用该工具包获得的评估结果，我们托管OpenVLM Leaderboard，这是一个全面的排行榜，用于跟踪多模态学习研究的进展。该工具包发布在https://github.com/open-compass/VLMEvalKit，并得到积极维护。

English

We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 70 different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 20 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released at https://github.com/open-compass/VLMEvalKit and is actively maintained.

VLMEvalKit：用于评估大型多模态模型的开源工具包

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

摘要

Support