HakushoBench: 정부 백서 기반 일본 차트 및 표 VQA 벤치마크

초록

차트와 테이블 이미지를 이해하는 것은 시각-언어 모델(VLM)을 실제 문서 이해에 적용하는 데 필수적이다. 영어 벤치마크는 빠르게 발전해 왔지만, 비영어권 벤치마크는 여전히 부족하여 이러한 진보가 언어를 넘어 일반화되는지 여부가 불분명하다. 주요 장애물은 규모가 큰 현실적이고 다양한 비영어권 차트 및 테이블 이미지를 수집하기 어렵다는 점이다. 이를 해결하기 위해 우리는 정부 백서를 영어 외 언어 벤치마크 구축을 위한 확장 가능한 자료원으로 활용한다. 정부 백서는 다양한 형식과 도메인에 걸친 자연 발생적 차트와 테이블을 포함하고 있으며, 많은 국가에서 자유롭게 접근할 수 있기 때문이다. 첫 번째 사례로, 우리는 33개의 정부 백서로 구축된 도전적인 일본어 차트 및 테이블 VQA 벤치마크인 HakushoBench를 소개한다. HakushoBench는 10개 이상의 이미지 유형에 걸친 2,053개의 이미지를 포함하며, 수동으로 주석 처리된 QA 쌍을 갖추고 있다. 이는 지역적 시각 단서만이 아니라 차트와 테이블에 대한 깊고 전체적인 이해를 평가하도록 설계되었다. 다양한 VLM에 걸친 실험 결과, HakushoBench는 오픈웨이트 모델에게 여전히 도전적인 과제임을 보여준다. 최고 성능의 오픈웨이트 모델은 58.6%의 정확도에 그쳤으며, 오픈웨이트 모델과 독점 모델 간의 34.9% 포인트 차이는 복잡한 차트 및 테이블 이해에 있어 상당한 개선 여지가 있음을 시사한다. 우리는 데이터셋과 코드를 공개한다.

English

Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, non-English counterparts remain scarce, leaving it unclear whether this progress generalizes across languages. A key obstacle is the difficulty of collecting realistic and diverse non-English chart and table images at scale. To address this, we leverage governmental white papers as a scalable source for benchmark construction beyond English, as they contain naturally occurring charts and tables across diverse formats and domains and are freely accessible in many countries. As a first instantiation, we introduce HakushoBench, a challenging Japanese chart and table VQA benchmark built from 33 governmental white papers. HakushoBench contains 2,053 images spanning over 10 image types, with manually annotated QA pairs, designed to assess deep and holistic understanding of charts and tables, rather than local visual cues alone. Experiments across a broad range of VLMs demonstrate that HakushoBench remains challenging for open-weight models: the best open-weight model achieves only 58.6% accuracy, and a 34.9-point gap between open-weight and proprietary models highlights substantial room for improvement in complex chart and table understanding. We release our dataset and code.