WildTableBench：在現實場景中對多模態基礎模型進行表格理解的基準測試

摘要

使用多模态基础模型分析表格图像，在消费和企业场景中是一项高价值但具挑战性的应用。尽管其重要性显著，但当前的评估主要依赖于结构化文本表格或干净渲染图像，忽略了真实场景中表格图像的视觉复杂性。这些图像呈现多样化布局和跨领域特征，需要复杂的结构感知与数值推理能力。为弥补这一空白，我们提出了WildTableBench——首个针对真实场景中自然出现的表格图像的问答基准。该基准包含从跨领域在线论坛和网站收集的402张高信息密度表格图像，以及928个经人工标注与验证的问题，涵盖五个大类的17个子类型。我们在该基准上评估了21个前沿专有和开源多模态基础模型。仅有一个模型准确率超过50%，其余模型准确率范围在4.1%至49.9%之间。我们进一步开展诊断分析，以刻画模型失败模式，揭示其在结构感知和推理方面持续存在的弱点。这些结果与分析为当前模型能力提供了有价值的见解，并将WildTableBench确立为评估表格图像理解能力的重要诊断基准。

English

Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning. These results and analyses provide useful insights into current model capabilities and establish WildTableBench as a valuable diagnostic benchmark for table image understanding.