ChartGemma:針對真實情境中的圖表推理進行視覺指導調整
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
July 4, 2024
作者: Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, Shafiq Joty
cs.AI
摘要
鑑於圖表在各行業和科學領域作為數據分析、視覺化和決策工具的普及,人們對開發預訓練基礎模型以及通用指導調整模型以進行圖表理解和推理的興趣日益增長。然而,現有方法存在兩個關鍵軸上的重要缺陷,影響著圖表表示模型的性能:它們是在從圖表的基礎數據表生成的數據上進行訓練,忽略了圖表圖像中的視覺趨勢和模式,並使用弱對齊的視覺-語言骨幹模型進行特定領域的訓練,限制了它們在遇到實際圖表時的泛化能力。我們解決了這些重要缺陷,並介紹了ChartGemma,這是一個新穎的圖表理解和推理模型,是在PaliGemma之上開發的。ChartGemma不依賴於基礎數據表,而是在直接從圖表圖像生成的指導調整數據上進行訓練,從而捕捉來自各種圖表的高層趨勢和低層視覺信息。我們的簡單方法在涵蓋圖表摘要、問答和事實核查的5個基準測試中取得了最先進的結果,我們對真實世界圖表進行了詳盡的定性研究,結果顯示ChartGemma相對於其同行者生成了更加真實和事實準確的摘要。我們在https://github.com/vis-nlp/ChartGemma 發布了代碼、模型檢查點、數據集和演示。
English
Given the ubiquity of charts as a data analysis, visualization, and
decision-making tool across industries and sciences, there has been a growing
interest in developing pre-trained foundation models as well as general purpose
instruction-tuned models for chart understanding and reasoning. However,
existing methods suffer crucial drawbacks across two critical axes affecting
the performance of chart representation models: they are trained on data
generated from underlying data tables of the charts, ignoring the visual trends
and patterns in chart images, and use weakly aligned vision-language backbone
models for domain-specific training, limiting their generalizability when
encountering charts in the wild. We address these important drawbacks and
introduce ChartGemma, a novel chart understanding and reasoning model developed
over PaliGemma. Rather than relying on underlying data tables, ChartGemma is
trained on instruction-tuning data generated directly from chart images, thus
capturing both high-level trends and low-level visual information from a
diverse set of charts. Our simple approach achieves state-of-the-art results
across 5 benchmarks spanning chart summarization, question answering, and
fact-checking, and our elaborate qualitative studies on real-world charts show
that ChartGemma generates more realistic and factually correct summaries
compared to its contemporaries. We release the code, model checkpoints,
dataset, and demos at https://github.com/vis-nlp/ChartGemma.