VLMEvalKit: 대규모 다중 모달리티 모델 평가를 위한 오픈소스 툴킷

초록

저희는 PyTorch 기반의 대규모 다중 모달리티 모델 평가를 위한 오픈소스 툴킷인 VLMEvalKit을 소개합니다. 이 툴킷은 연구자와 개발자들이 기존의 다중 모달리티 모델을 평가하고 재현 가능한 평가 결과를 발표할 수 있도록 사용자 친화적이고 포괄적인 프레임워크를 제공하는 것을 목표로 합니다. VLMEvalKit에서는 70종 이상의 다양한 대규모 다중 모달리티 모델(상용 API 및 오픈소스 모델 포함)과 20종 이상의 다중 모달 벤치마크를 구현했습니다. 단일 인터페이스를 구현함으로써 새로운 모델을 쉽게 툴킷에 추가할 수 있으며, 툴킷은 데이터 준비, 분산 추론, 예측 후처리, 메트릭 계산 등의 나머지 작업을 자동으로 처리합니다. 현재 이 툴킷은 주로 대규모 시각-언어 모델 평가에 사용되고 있지만, 오디오 및 비디오와 같은 추가 모달리티를 통합할 수 있도록 설계되어 향후 업데이트와도 호환됩니다. 툴킷을 통해 얻은 평가 결과를 바탕으로, 다중 모달리티 학습 연구의 진행 상황을 추적하기 위한 포괄적인 리더보드인 OpenVLM Leaderboard를 운영하고 있습니다. 이 툴킷은 https://github.com/open-compass/VLMEvalKit에서 공개되었으며, 지속적으로 유지 관리되고 있습니다.

English

We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 70 different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 20 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released at https://github.com/open-compass/VLMEvalKit and is actively maintained.

VLMEvalKit: 대규모 다중 모달리티 모델 평가를 위한 오픈소스 툴킷

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

초록

Support