ViExam：視覺語言模型在越南多模態考試題目上是否優於人類？

ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?

August 19, 2025

作者: Vy Tuong Dang, An Vo, Quang Tau, Duc Dm, Daeyoung Kim

cs.AI

摘要

視覺語言模型（VLMs）在英語多模態任務上展現了顯著的能力，但其在低資源語言且真正多模態教育內容上的表現仍大多未被探索。本研究測試了VLMs在越南教育評估中的表現，探討了主要基於英語數據訓練的VLMs是否能處理現實世界的跨語言多模態推理。我們的工作首次全面評估了VLMs在多模態越南考試上的能力，通過提出ViExam這一包含2,548道多模態問題的基準。我們發現，最先進的VLMs在包括數學、物理、化學、生物、地理、駕駛考試和智商測試在內的7個學術領域中，平均準確率僅為57.74%，而開源模型的平均準確率為27.70%。大多數VLMs的表現低於人類考生的平均水平（66.54%），僅有思考型VLM o3（74.07%）超過了人類平均表現，但仍遠低於人類最佳表現（99.60%）。在保持越南內容的同時使用英語指令進行跨語言提示未能提升表現，反而使最先進VLMs的準確率下降了1個百分點。人機協作可以部分提升VLMs的表現，提高5個百分點。代碼和數據可在以下網址獲取：https://vi-exam.github.io。

English

Vision language models (VLMs) demonstrate remarkable capabilities on English multimodal tasks, but their performance on low-resource languages with genuinely multimodal educational content remains largely unexplored. In this work, we test how VLMs perform on Vietnamese educational assessments, investigating whether VLMs trained predominantly on English data can handle real-world cross-lingual multimodal reasoning. Our work presents the first comprehensive evaluation of VLM capabilities on multimodal Vietnamese exams through proposing ViExam, a benchmark containing 2,548 multimodal questions. We find that state-of-the-art VLMs achieve only 57.74% while open-source models achieve 27.70% mean accuracy across 7 academic domains, including Mathematics, Physics, Chemistry, Biology, Geography, Driving Test, and IQ Test. Most VLMs underperform average human test-takers (66.54%), with only the thinking VLM o3 (74.07%) exceeding human average performance, yet still falling substantially short of human best performance (99.60%). Cross-lingual prompting with English instructions while maintaining Vietnamese content fails to improve performance, decreasing accuracy by 1 percentage point for SOTA VLMs. Human-in-the-loop collaboration can partially improve VLM performance by 5 percentage points. Code and data are available at: https://vi-exam.github.io.