ChartCap：缓解密集图表描述中的幻觉现象

摘要

為圖表生成精確、信息豐富且無幻覺的標題，對於視覺語言模型而言仍是一大挑戰，這主要歸因於缺乏大規模、高質量的現實世界圖表數據集。然而，現有的現實世界圖表數據集存在包含無法從圖表中推斷出的額外信息，以及未能充分捕捉結構元素和關鍵洞察的問題。因此，我們引入了ChartCap，這是一個包含565K張現實世界圖表圖像的大規模數據集，每張圖像都配備了類型特定的密集標題，這些標題排除了額外信息，並詳細突出了結構元素和關鍵洞察。為了構建ChartCap，我們設計了一個四階段流程，該流程僅利用圖表中可辨識的數據生成標題，並採用基於循環一致性的人工驗證，這在不犧牲準確性的前提下加速了質量控制。此外，我們提出了一種新穎的度量標準——視覺一致性分數，該分數通過比較從標題重新生成的圖表與原始圖表之間的相似性來評估標題質量，且不依賴於參考標題。大量實驗證實，基於ChartCap微調的模型能夠持續生成更為準確和信息豐富的標題，並減少幻覺，超越了開源和專有模型，甚至優於人工註釋的標題。

English

Generating accurate, informative, and hallucination-free captions for charts remains challenging for vision language models, primarily due to the lack of large-scale, high-quality datasets of real-world charts. However, existing real-world chart datasets suffer from the inclusion of extraneous information that cannot be inferred from the chart and failure to sufficiently capture structural elements and key insights. Therefore, we introduce ChartCap, a large-scale dataset of 565K real-world chart images paired with type-specific, dense captions that exclude extraneous information and highlight both structural elements and key insights in detail. To build ChartCap, we design a four-stage pipeline that generates captions using only the discernible data from the chart and employ a cycle consistency-based human verification, which accelerates quality control without sacrificing accuracy. Additionally, we propose a novel metric, the Visual Consistency Score, which evaluates caption quality by measuring the similarity between the chart regenerated from a caption and the original chart, independent of reference captions. Extensive experiments confirms that models fine-tuned on ChartCap consistently generate more accurate and informative captions with reduced hallucinations, surpassing both open-source and proprietary models and even human-annotated captions.

ChartCap：缓解密集图表描述中的幻觉现象

ChartCap: Mitigating Hallucination of Dense Chart Captioning

摘要

Support