Visual-TableQA: 表画像推論のためのオープンドメインベンチマーク

要旨

表のような構造化データに対する視覚的推論は、現代の視覚言語モデル（VLM）にとって重要な能力である。しかし、現在のベンチマークは、特にレンダリングされた表画像に関して、規模、多様性、または推論の深さの点で限界がある。このギャップを埋めるため、我々はVisual-TableQAを導入する。これは、複雑な表データに対する視覚的推論を評価し強化するために特別に設計された、大規模でオープンドメインのマルチモーダルデータセットである。我々の生成パイプラインはモジュール式でスケーラブル、かつ完全に自律的であり、生成、検証、インスピレーションという異なる役割を担う複数の推論LLMが協力する。Visual-TableQAは、2.5kの豊富に構造化されたLaTeXレンダリング表と6kの推論集約型QAペアで構成され、その生成コストは100米ドル未満である。多様性と創造性を促進するため、我々のパイプラインは、クロスモデルプロンプティング（「インスピレーション」）とLLMジャリーによるフィルタリングを介したマルチモデル協調データ生成を実行する。より強力なモデルがレイアウトとトピックをシードし、より弱いモデルがそれを詳細化することで、多様な推論パターンと視覚的構造をデータセットに蒸留する。実験結果は、Visual-TableQAでファインチューニングされたモデルが外部ベンチマークに対して堅牢に一般化し、データセットの合成性にもかかわらず、いくつかのプロプライエタリモデルを上回ることを示している。完全なパイプラインとリソースは、https://github.com/AI-4-Everyone/Visual-TableQA で公開されている。

English

Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting ('inspiration') and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset's synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.

Visual-TableQA: 表画像推論のためのオープンドメインベンチマーク

Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

要旨

Support