Appear2Meaning: 画像からの構造化された文化メタデータ推論のための異文化間ベンチマーク

要旨

近年、視覚言語モデル（VLM）の進歩により、文化遺産における画像キャプション生成は改善されてきた。しかし、視覚入力から構造化された文化メタデータ（作成者、起源、時代など）を推論する課題は未開拓のままである。本研究では、この課題に対する多カテゴリ・異文化間ベンチマークを提案し、参照注釈との意味的整合性を測定するLLM-as-Judgeフレームワークを用いてVLMを評価する。文化的推論能力を評価するため、文化圏ごとに完全一致精度、部分一致精度、属性レベル精度を報告する。結果から、モデルは断片的な信号を捉えるものの、文化圏やメタデータ種別によって性能に大きなばらつきがあり、一貫性がなく根拠の弱い予測を行うことが明らかになった。これらの知見は、視覚的知覚を超えた構造的文化メタデータ推論における現行VLMの限界を示唆している。

English

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

Appear2Meaning: 画像からの構造化された文化メタデータ推論のための異文化間ベンチマーク

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

要旨

Support