ChartCap：缓解密集图表描述中的幻觉问题

摘要

为图表生成准确、信息丰富且无幻觉的标题，对于视觉语言模型而言仍具挑战性，这主要归因于缺乏大规模、高质量的真实世界图表数据集。然而，现有的真实世界图表数据集存在包含无法从图表中推断的冗余信息，以及未能充分捕捉结构要素和关键洞察的问题。为此，我们推出了ChartCap，一个包含565K张真实世界图表图像的大规模数据集，每张图像均配有类型特定、密集的标题，这些标题排除了冗余信息，并详细突出了结构要素和关键洞察。构建ChartCap的过程中，我们设计了一个四阶段流程，仅利用图表中可辨识的数据生成标题，并采用基于循环一致性的人工验证，在不牺牲准确性的前提下加速了质量控制。此外，我们提出了一种新颖的评估指标——视觉一致性评分，该指标通过衡量从标题重新生成的图表与原始图表之间的相似度来评估标题质量，独立于参考标题。大量实验证实，基于ChartCap微调的模型能够持续生成更准确、信息更丰富的标题，减少了幻觉现象，超越了开源和专有模型，甚至优于人工标注的标题。

English

Generating accurate, informative, and hallucination-free captions for charts remains challenging for vision language models, primarily due to the lack of large-scale, high-quality datasets of real-world charts. However, existing real-world chart datasets suffer from the inclusion of extraneous information that cannot be inferred from the chart and failure to sufficiently capture structural elements and key insights. Therefore, we introduce ChartCap, a large-scale dataset of 565K real-world chart images paired with type-specific, dense captions that exclude extraneous information and highlight both structural elements and key insights in detail. To build ChartCap, we design a four-stage pipeline that generates captions using only the discernible data from the chart and employ a cycle consistency-based human verification, which accelerates quality control without sacrificing accuracy. Additionally, we propose a novel metric, the Visual Consistency Score, which evaluates caption quality by measuring the similarity between the chart regenerated from a caption and the original chart, independent of reference captions. Extensive experiments confirms that models fine-tuned on ChartCap consistently generate more accurate and informative captions with reduced hallucinations, surpassing both open-source and proprietary models and even human-annotated captions.

ChartCap：缓解密集图表描述中的幻觉问题

ChartCap: Mitigating Hallucination of Dense Chart Captioning

摘要

Support