WildTableBench: 実環境におけるテーブル理解のためのマルチモーダル基盤モデルのベンチマーキング

要旨

マルチモーダル基盤モデルを用いたテーブル画像の分析は、消費者向けおよびエンタープライズ向けのシナリオにおいて価値が高い一方で、困難を伴う応用である。その重要性にもかかわらず、現在の評価は主に構造化テキストのテーブルまたはレンダリングされたクリーンな画像に依存しており、実環境のテーブル画像が持つ視覚的な複雑さは十分に探究されていない。このような画像は多様なレイアウトと様々なドメインを特徴とし、高度な構造認識と数値推論を必要とする。このギャップを埋めるため、我々は実世界の環境から得られた自然発生的なテーブル画像に対する初の質問応答ベンチマークであるWildTableBenchを提案する。WildTableBenchは、多様なドメインにわたるオンラインフォーラムやウェブサイトから収集した402枚の高情報密度テーブル画像と、5カテゴリ17サブタイプにわたる928個の手動アノテーションおよび検証済み質問で構成される。我々はこのベンチマークを用いて、21の最先端のプロプライエタリおよびオープンソースのマルチモーダル基盤モデルを評価した。1つのモデルのみが50%を超える精度を達成し、残りのモデルは4.1%から49.9%の範囲にとどまった。さらに、モデルの失敗を特徴づける診断的分析を実施し、構造認識と推論における持続的な弱点を明らかにした。これらの結果と分析は、現在のモデルの能力に関する有用な洞察を提供し、WildTableBenchをテーブル画像理解のための貴重な診断用ベンチマークとして確立するものである。

English

Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning. These results and analyses provide useful insights into current model capabilities and establish WildTableBench as a valuable diagnostic benchmark for table image understanding.