ChartCap: 밀집 차트 캡션 생성의 환각 현상 완화

초록

차트에 대한 정확하고 유익하며 환각(hallucination)이 없는 캡션을 생성하는 것은 시각 언어 모델에게 여전히 어려운 과제로, 이는 주로 실제 세계의 차트를 포함한 대규모 고품질 데이터셋의 부족 때문입니다. 그러나 기존의 실제 세계 차트 데이터셋은 차트에서 추론할 수 없는 외부 정보가 포함되어 있고 구조적 요소와 핵심 통찰력을 충분히 포착하지 못하는 문제가 있습니다. 따라서 우리는 ChartCap를 소개합니다. 이는 565K개의 실제 세계 차트 이미지와 짝을 이루는 대규모 데이터셋으로, 외부 정보를 배제하고 구조적 요소와 핵심 통찰력을 상세히 강조하는 유형별 밀집 캡션을 포함합니다. ChartCap를 구축하기 위해, 우리는 차트에서 식별 가능한 데이터만을 사용하여 캡션을 생성하는 4단계 파이프라인을 설계하고, 정확도를 희생하지 않으면서 품질 관리를 가속화하는 주기 일관성 기반의 인간 검증을 활용했습니다. 또한, 우리는 참조 캡션과 독립적으로 캡션에서 재생성된 차트와 원본 차트 간의 유사성을 측정하여 캡션 품질을 평가하는 새로운 지표인 시각 일관성 점수(Visual Consistency Score)를 제안합니다. 광범위한 실험을 통해 ChartCap로 미세 조정된 모델이 더 정확하고 유익한 캡션을 생성하며 환각을 줄이는 데 있어 오픈소스 및 상용 모델뿐만 아니라 인간 주석 캡션을 능가함을 확인했습니다.

English

Generating accurate, informative, and hallucination-free captions for charts remains challenging for vision language models, primarily due to the lack of large-scale, high-quality datasets of real-world charts. However, existing real-world chart datasets suffer from the inclusion of extraneous information that cannot be inferred from the chart and failure to sufficiently capture structural elements and key insights. Therefore, we introduce ChartCap, a large-scale dataset of 565K real-world chart images paired with type-specific, dense captions that exclude extraneous information and highlight both structural elements and key insights in detail. To build ChartCap, we design a four-stage pipeline that generates captions using only the discernible data from the chart and employ a cycle consistency-based human verification, which accelerates quality control without sacrificing accuracy. Additionally, we propose a novel metric, the Visual Consistency Score, which evaluates caption quality by measuring the similarity between the chart regenerated from a caption and the original chart, independent of reference captions. Extensive experiments confirms that models fine-tuned on ChartCap consistently generate more accurate and informative captions with reduced hallucinations, surpassing both open-source and proprietary models and even human-annotated captions.

ChartCap: 밀집 차트 캡션 생성의 환각 현상 완화

ChartCap: Mitigating Hallucination of Dense Chart Captioning

초록

Support