视觉表格问答：面向表格图像推理的开放域基准测试

摘要

针对表格等结构化数据的视觉推理是现代视觉-语言模型（VLMs）的一项关键能力，然而当前的基准测试在规模、多样性或推理深度方面仍显不足，尤其是在处理渲染后的表格图像时。为填补这一空白，我们推出了Visual-TableQA，这是一个大规模、开放领域的多模态数据集，专门设计用于评估和提升对复杂表格数据的视觉推理能力。我们的生成流程模块化、可扩展且完全自动化，涉及多个推理大语言模型（LLMs）在生成、验证和启发等不同角色上的协作。Visual-TableQA包含2.5k个结构丰富的LaTeX渲染表格和6k个推理密集型的问答对，所有内容的生产成本不足100美元。为促进多样性和创造性，我们的流程通过跨模型提示（“启发”）和LLM陪审团过滤实现多模型协作数据生成。更强的模型负责布局和主题的初步构思，较弱的模型则进行细化，共同将多样化的推理模式和视觉结构提炼到数据集中。实证结果表明，在Visual-TableQA上微调的模型能够稳健地泛化至外部基准测试，尽管数据集为合成性质，但仍超越多个专有模型的表现。完整的流程和资源已公开于https://github.com/AI-4-Everyone/Visual-TableQA。

English

Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting ('inspiration') and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset's synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.

视觉表格问答：面向表格图像推理的开放域基准测试

Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

摘要

Support