WorldMedQA-V: マルチモーダル言語モデル評価のための多言語、マルチモーダル医学検査データセット

要旨

マルチモーダル/ビジョン言語モデル（VLMs）は、世界中の医療現場でますます展開されており、それらの安全性、有効性、公平性を確保するための堅牢なベンチマークが必要とされています。国立医学試験から派生した多肢選択式質問と回答（QA）データセットは、長い間価値ある評価ツールとして機能してきましたが、既存のデータセットは主にテキストのみであり、言語や国の限られたサブセットで利用可能です。これらの課題に対処するために、私たちはWorldMedQA-Vを提案します。これは、医療分野におけるVLMsの評価を目的とした更新された多言語、マルチモーダルなベンチマークデータセットです。WorldMedQA-Vには、4つの国（ブラジル、イスラエル、日本、スペイン）からの568個のラベル付き多肢選択式QAとそれに対応する568枚の医療画像が含まれており、それぞれの元の言語と母国の臨床医による英語の検証された翻訳をカバーしています。一般的なオープンソースおよびクローズドソースモデルのベースライン性能が、ローカル言語と英語の翻訳、およびモデルに画像を提供する場合としない場合の両方で提供されています。WorldMedQA-Vベンチマークは、AIシステムを展開される多様な医療環境により適合させることを目指し、より公正で効果的かつ代表的なアプリケーションを促進します。

English

Multimodal/vision language models (VLMs) are increasingly being deployed in healthcare settings worldwide, necessitating robust benchmarks to ensure their safety, efficacy, and fairness. Multiple-choice question and answer (QA) datasets derived from national medical examinations have long served as valuable evaluation tools, but existing datasets are largely text-only and available in a limited subset of languages and countries. To address these challenges, we present WorldMedQA-V, an updated multilingual, multimodal benchmarking dataset designed to evaluate VLMs in healthcare. WorldMedQA-V includes 568 labeled multiple-choice QAs paired with 568 medical images from four countries (Brazil, Israel, Japan, and Spain), covering original languages and validated English translations by native clinicians, respectively. Baseline performance for common open- and closed-source models are provided in the local language and English translations, and with and without images provided to the model. The WorldMedQA-V benchmark aims to better match AI systems to the diverse healthcare environments in which they are deployed, fostering more equitable, effective, and representative applications.