ViExam: ベトナムのマルチモーダル試験問題において、視覚言語モデルは人間よりも優れているか？

要旨

視覚言語モデル（VLMs）は、英語のマルチモーダルタスクにおいて顕著な能力を発揮しますが、真にマルチモーダルな教育コンテンツを持つ低リソース言語での性能はほとんど未解明のままです。本研究では、VLMsがベトナムの教育評価においてどのように機能するかをテストし、主に英語データで訓練されたVLMsが現実世界のクロスリンガルなマルチモーダル推論を処理できるかどうかを調査します。私たちの研究は、2,548のマルチモーダル問題を含むベンチマーク「ViExam」を提案し、ベトナムのマルチモーダル試験におけるVLMsの能力を初めて包括的に評価します。その結果、最先端のVLMsは平均57.74%の精度しか達成せず、オープンソースモデルは数学、物理、化学、生物、地理、運転試験、IQテストを含む7つの学術領域で平均27.70%の精度にとどまることがわかりました。ほとんどのVLMsは平均的な人間の受験者（66.54%）を下回り、思考型VLM o3（74.07%）のみが人間の平均性能を上回りましたが、人間の最高性能（99.60%）には大きく及ばない結果でした。英語の指示を用いたクロスリンガルプロンプティングは、ベトナム語のコンテンツを維持しても性能を改善せず、最先端のVLMsでは精度が1ポイント低下しました。人間の介入による協力は、VLMsの性能を部分的に5ポイント向上させることができました。コードとデータはhttps://vi-exam.github.ioで公開されています。

English

Vision language models (VLMs) demonstrate remarkable capabilities on English multimodal tasks, but their performance on low-resource languages with genuinely multimodal educational content remains largely unexplored. In this work, we test how VLMs perform on Vietnamese educational assessments, investigating whether VLMs trained predominantly on English data can handle real-world cross-lingual multimodal reasoning. Our work presents the first comprehensive evaluation of VLM capabilities on multimodal Vietnamese exams through proposing ViExam, a benchmark containing 2,548 multimodal questions. We find that state-of-the-art VLMs achieve only 57.74% while open-source models achieve 27.70% mean accuracy across 7 academic domains, including Mathematics, Physics, Chemistry, Biology, Geography, Driving Test, and IQ Test. Most VLMs underperform average human test-takers (66.54%), with only the thinking VLM o3 (74.07%) exceeding human average performance, yet still falling substantially short of human best performance (99.60%). Cross-lingual prompting with English instructions while maintaining Vietnamese content fails to improve performance, decreasing accuracy by 1 percentage point for SOTA VLMs. Human-in-the-loop collaboration can partially improve VLM performance by 5 percentage points. Code and data are available at: https://vi-exam.github.io.

ViExam: ベトナムのマルチモーダル試験問題において、視覚言語モデルは人間よりも優れているか？

ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?

要旨

Support