《图见文化：跨文化图像结构化元数据推断基准》

摘要

近期视觉语言模型在文化遗产图像描述方面取得进展，但基于视觉输入推断结构化文化元数据（如创作者、起源地、年代）的研究仍显不足。我们为此任务构建了一个多类别、跨文化的基准数据集，并采用大语言模型即评判框架评估视觉语言模型，通过测量其输出与参考标注的语义匹配度进行分析。为考察文化推理能力，我们按文化区域统计了完全匹配、部分匹配及属性层级的准确率。结果表明，现有模型仅能捕捉碎片化信号，且在不同文化背景和元数据类型上表现差异显著，导致预测结果缺乏一致性和扎实依据。这些发现揭示了当前视觉语言模型在超越视觉感知的结构化文化元数据推断领域存在局限性。

English

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

《图见文化：跨文化图像结构化元数据推断基准》

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

摘要

Support