ChartCap:缓解密集图表描述中的幻觉现象
ChartCap: Mitigating Hallucination of Dense Chart Captioning
August 5, 2025
作者: Junyoung Lim, Jaewoo Ahn, Gunhee Kim
cs.AI
摘要
為圖表生成精確、信息豐富且無幻覺的標題,對於視覺語言模型而言仍是一大挑戰,這主要歸因於缺乏大規模、高質量的現實世界圖表數據集。然而,現有的現實世界圖表數據集存在包含無法從圖表中推斷出的額外信息,以及未能充分捕捉結構元素和關鍵洞察的問題。因此,我們引入了ChartCap,這是一個包含565K張現實世界圖表圖像的大規模數據集,每張圖像都配備了類型特定的密集標題,這些標題排除了額外信息,並詳細突出了結構元素和關鍵洞察。為了構建ChartCap,我們設計了一個四階段流程,該流程僅利用圖表中可辨識的數據生成標題,並採用基於循環一致性的人工驗證,這在不犧牲準確性的前提下加速了質量控制。此外,我們提出了一種新穎的度量標準——視覺一致性分數,該分數通過比較從標題重新生成的圖表與原始圖表之間的相似性來評估標題質量,且不依賴於參考標題。大量實驗證實,基於ChartCap微調的模型能夠持續生成更為準確和信息豐富的標題,並減少幻覺,超越了開源和專有模型,甚至優於人工註釋的標題。
English
Generating accurate, informative, and hallucination-free captions for charts
remains challenging for vision language models, primarily due to the lack of
large-scale, high-quality datasets of real-world charts. However, existing
real-world chart datasets suffer from the inclusion of extraneous information
that cannot be inferred from the chart and failure to sufficiently capture
structural elements and key insights. Therefore, we introduce ChartCap, a
large-scale dataset of 565K real-world chart images paired with type-specific,
dense captions that exclude extraneous information and highlight both
structural elements and key insights in detail. To build ChartCap, we design a
four-stage pipeline that generates captions using only the discernible data
from the chart and employ a cycle consistency-based human verification, which
accelerates quality control without sacrificing accuracy. Additionally, we
propose a novel metric, the Visual Consistency Score, which evaluates caption
quality by measuring the similarity between the chart regenerated from a
caption and the original chart, independent of reference captions. Extensive
experiments confirms that models fine-tuned on ChartCap consistently generate
more accurate and informative captions with reduced hallucinations, surpassing
both open-source and proprietary models and even human-annotated captions.