《圖像顯義：跨文化圖像結構化文化元數據推斷基準研究》

摘要

近期视觉语言模型在文化遗产图像描述领域取得显著进展，然而从视觉输入中推断结构化文化元数据（如创作者、起源地、年代）的研究仍显不足。我们为此任务构建了一个多类别、跨文化的基准数据集，并采用"大语言模型作为评判者"的框架评估视觉语言模型，通过测量其输出与参考标注的语义对齐度进行分析。为评估文化推理能力，我们统计了不同文化区域下模型的精确匹配率、部分匹配率及属性级准确率。结果显示，现有模型仅能捕捉碎片化特征，且在不同文化背景和元数据类型间表现出显著性能差异，导致预测结果不一致且缺乏充分依据。这些发现揭示了当前视觉语言模型在超越视觉感知的结构化文化元数据推断方面存在局限性。

English

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

《圖像顯義：跨文化圖像結構化文化元數據推斷基準研究》

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

摘要

Support