HakushoBench：基于政府白皮书的日本图表与表格VQA基准

摘要

理解图表和表格图像对于将视觉语言模型（VLM）应用于现实世界的文档理解至关重要。尽管英文基准测试发展迅速，但非英文基准测试仍然稀缺，这让人不清楚这些进展能否跨越语言障碍实现泛化。一个关键障碍在于大规模收集真实且多样化的非英文图表和表格图像存在困难。为解决这一问题，我们利用政府白皮书作为超越英文的基准构建的可扩展来源，因为这些文件包含自然出现的、格式和领域多样的图表和表格，且在许多国家可免费获取。作为首次实践，我们推出了HakushoBench——一个基于33份政府白皮书构建的、具有挑战性的日文图表VQA基准测试。HakushoBench包含2,053张图像，涵盖超过10种图像类型，并配有手动标注的问答对，旨在评估对图表和表格的深入全面理解，而非仅依赖局部视觉线索。对多种VLM的实验表明，HakushoBench对开源权重模型仍具挑战性：最佳开源模型的准确率仅为58.6%，而开源权重模型与专有模型之间34.9个百分点的差距凸显了复杂图表理解领域仍有巨大的改进空间。我们公开发布了数据集和代码。

English

Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, non-English counterparts remain scarce, leaving it unclear whether this progress generalizes across languages. A key obstacle is the difficulty of collecting realistic and diverse non-English chart and table images at scale. To address this, we leverage governmental white papers as a scalable source for benchmark construction beyond English, as they contain naturally occurring charts and tables across diverse formats and domains and are freely accessible in many countries. As a first instantiation, we introduce HakushoBench, a challenging Japanese chart and table VQA benchmark built from 33 governmental white papers. HakushoBench contains 2,053 images spanning over 10 image types, with manually annotated QA pairs, designed to assess deep and holistic understanding of charts and tables, rather than local visual cues alone. Experiments across a broad range of VLMs demonstrate that HakushoBench remains challenging for open-weight models: the best open-weight model achieves only 58.6% accuracy, and a 34.9-point gap between open-weight and proprietary models highlights substantial room for improvement in complex chart and table understanding. We release our dataset and code.