ChartGemma：実世界におけるチャート推論のための視覚的指示チューニング

要旨

チャートは、業界や科学分野におけるデータ分析、可視化、意思決定ツールとして広く普及していることから、チャートの理解と推論のための事前学習済み基盤モデルや汎用目的の指示チューニングモデルの開発に対する関心が高まっています。しかし、既存の手法は、チャート表現モデルの性能に影響を与える2つの重要な軸において重大な欠点を抱えています。それらは、チャートの基盤となるデータテーブルから生成されたデータで学習されており、チャート画像内の視覚的なトレンドやパターンを無視していること、また、ドメイン固有の学習に弱く連携した視覚-言語バックボーンモデルを使用しているため、実世界のチャートに遭遇した際の汎化性能が制限されています。私たちはこれらの重要な欠点に対処し、PaliGemma上で開発された新しいチャート理解と推論モデルであるChartGemmaを紹介します。ChartGemmaは、基盤となるデータテーブルに依存するのではなく、チャート画像から直接生成された指示チューニングデータで学習されるため、多様なチャートから高レベルのトレンドと低レベルの視覚情報の両方を捕捉します。私たちのシンプルなアプローチは、チャート要約、質問応答、事実確認にわたる5つのベンチマークで最先端の結果を達成し、実世界のチャートに関する詳細な質的研究は、ChartGemmaが同時代のモデルと比較してより現実的で事実に基づいた要約を生成することを示しています。私たちは、コード、モデルチェックポイント、データセット、デモをhttps://github.com/vis-nlp/ChartGemmaで公開しています。

English

Given the ubiquity of charts as a data analysis, visualization, and decision-making tool across industries and sciences, there has been a growing interest in developing pre-trained foundation models as well as general purpose instruction-tuned models for chart understanding and reasoning. However, existing methods suffer crucial drawbacks across two critical axes affecting the performance of chart representation models: they are trained on data generated from underlying data tables of the charts, ignoring the visual trends and patterns in chart images, and use weakly aligned vision-language backbone models for domain-specific training, limiting their generalizability when encountering charts in the wild. We address these important drawbacks and introduce ChartGemma, a novel chart understanding and reasoning model developed over PaliGemma. Rather than relying on underlying data tables, ChartGemma is trained on instruction-tuning data generated directly from chart images, thus capturing both high-level trends and low-level visual information from a diverse set of charts. Our simple approach achieves state-of-the-art results across 5 benchmarks spanning chart summarization, question answering, and fact-checking, and our elaborate qualitative studies on real-world charts show that ChartGemma generates more realistic and factually correct summaries compared to its contemporaries. We release the code, model checkpoints, dataset, and demos at https://github.com/vis-nlp/ChartGemma.

ChartGemma：実世界におけるチャート推論のための視覚的指示チューニング

ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

要旨

Support