HakushoBench：來自政府白皮書的日語圖表與表格VQA基準

摘要

理解圖表與表格圖像對於將視覺語言模型應用於真實世界文件理解至關重要。儘管英文基準測試發展迅速，但非英文的對應基準仍相當稀缺，使得此進展能否跨語言泛化尚不明確。主要障礙在於難以大規模收集真實且多樣的非英文圖表與表格圖像。為解決此問題，我們利用政府白皮書作為英文以外語言的基準建構可擴展來源，因其包含自然出現、格式與領域多元的圖表與表格，且許多國家可自由取得。作為首個實例，我們推出HakushoBench——一個從33份政府白皮書建構而成的具挑戰性日文圖表與表格視覺問答基準。HakushoBench包含2,053張涵蓋超過10種圖像類型的圖片，並附有手動標註的問答對，旨在評估對圖表與表格的深度與整體理解，而非僅依賴局部視覺線索。在廣泛的視覺語言模型實驗中顯示，HakushoBench對開源權重模型仍具挑戰性：最佳開源權重模型僅達58.6%準確率，而開源權重與專有模型之間34.9個百分點的差距，凸顯出在複雜圖表與表格理解上仍有大幅改進空間。我們將公開數據集與程式碼。

English

Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, non-English counterparts remain scarce, leaving it unclear whether this progress generalizes across languages. A key obstacle is the difficulty of collecting realistic and diverse non-English chart and table images at scale. To address this, we leverage governmental white papers as a scalable source for benchmark construction beyond English, as they contain naturally occurring charts and tables across diverse formats and domains and are freely accessible in many countries. As a first instantiation, we introduce HakushoBench, a challenging Japanese chart and table VQA benchmark built from 33 governmental white papers. HakushoBench contains 2,053 images spanning over 10 image types, with manually annotated QA pairs, designed to assess deep and holistic understanding of charts and tables, rather than local visual cues alone. Experiments across a broad range of VLMs demonstrate that HakushoBench remains challenging for open-weight models: the best open-weight model achieves only 58.6% accuracy, and a 34.9-point gap between open-weight and proprietary models highlights substantial room for improvement in complex chart and table understanding. We release our dataset and code.