HakushoBench: 政府白書に基づく日本語の図表VQAベンチマーク

要旨

グラフや表画像の理解は、視覚言語モデル（VLM）を実世界の文書理解に応用する上で不可欠である。英語のベンチマークは急速に進展している一方で、非英語のベンチマークは依然として乏しく、この進展が言語を超えて一般化するかは明らかではない。主な障害は、現実的で多様な非英語のグラフや表画像を大規模に収集することの難しさである。この課題に対処するため、我々は政府白書を、英語以外のベンチマーク構築のためのスケーラブルな情報源として活用する。政府白書には多様な形式や分野にわたる自然発生のグラフや表が含まれており、多くの国で自由にアクセス可能だからである。最初の具体例として、33の政府白書から構築した、挑戦的な日本語のグラフ・表VQAベンチマークであるHakushoBenchを紹介する。HakushoBenchは10以上の画像タイプにわたる2,053枚の画像を含み、手動でアノテーションされたQAペアを備えており、局所的な視覚的手がかりのみではなく、グラフや表の深く総合的な理解を評価するように設計されている。幅広いVLMを用いた実験により、HakushoBenchがオープンウェイトモデルにとって依然として困難であることが示された。最高性能のオープンウェイトモデルでも精度は58.6%にとどまり、オープンウェイトモデルとプロプライエタリモデルの間には34.9ポイントの差があり、複雑なグラフや表の理解には大きな改善の余地があることが浮き彫りになった。我々はデータセットとコードを公開する。

English

Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, non-English counterparts remain scarce, leaving it unclear whether this progress generalizes across languages. A key obstacle is the difficulty of collecting realistic and diverse non-English chart and table images at scale. To address this, we leverage governmental white papers as a scalable source for benchmark construction beyond English, as they contain naturally occurring charts and tables across diverse formats and domains and are freely accessible in many countries. As a first instantiation, we introduce HakushoBench, a challenging Japanese chart and table VQA benchmark built from 33 governmental white papers. HakushoBench contains 2,053 images spanning over 10 image types, with manually annotated QA pairs, designed to assess deep and holistic understanding of charts and tables, rather than local visual cues alone. Experiments across a broad range of VLMs demonstrate that HakushoBench remains challenging for open-weight models: the best open-weight model achieves only 58.6% accuracy, and a 34.9-point gap between open-weight and proprietary models highlights substantial room for improvement in complex chart and table understanding. We release our dataset and code.