WildTableBench: 실세계 테이블 이해를 위한 멀티모달 파운데이션 모델 벤치마킹

초록

멀티모달 기반 모델을 사용하여 테이블 이미지를 분석하는 것은 소비자 및 기업 시나리오에서 가치가 높으면서도 도전적인 응용 분야입니다. 이러한 중요성에도 불구하고, 현재의 평가는 주로 구조화된 텍스트 기반 테이블이나 깔끔하게 렌더링된 이미지에 의존하고 있어, 실제 현장에서 접하는 테이블 이미지의 시각적 복잡성은 충분히 탐구되지 못하고 있습니다. 이러한 이미지는 다양한 레이아웃과 도메인을 특징으로 하며, 정교한 구조 인식과 수치 추론을 요구합니다. 이러한 격차를 해소하기 위해, 우리는 실제 환경에서 자연적으로 발생하는 테이블 이미지를 대상으로 한 최초의 질문-응답 벤치마크인 WildTableBench를 소개합니다. WildTableBench는 다양한 도메인의 온라인 포럼과 웹사이트에서 수집한 402개의 고정보밀도 테이블 이미지와 함께, 5개 범주에 걸친 17개 하위 유형의 928개의 수동 주석 및 검증된 질문으로 구성됩니다. 우리는 이 벤치마크를 사용하여 21개의 최첨단 독점 및 오픈소스 멀티모달 기반 모델을 평가했습니다. 단 하나의 모델만이 50% 정확도를 초과했으며, 나머지 모델들은 4.1%에서 49.9% 범위의 성능을 보였습니다. 또한 진단 분석을 수행하여 모델 실패의 특성을 파악하고, 구조 인식과 추론에서 지속적인 약점을 드러냈습니다. 이러한 결과와 분석은 현재 모델 역량에 대한 유용한 통찰을 제공하며, WildTableBench를 테이블 이미지 이해를 위한 가치 있는 진단 벤치마크로 확립합니다.

English

Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning. These results and analyses provide useful insights into current model capabilities and establish WildTableBench as a valuable diagnostic benchmark for table image understanding.