Appear2Meaning: 이미지에서 구조화된 문화적 메타데이터 추론을 위한 교차 문화 벤치마크

초록

최신 시각-언어 모델(VLM)의 발전으로 문화유산 분야의 이미지 캡션 생성 기술이 향상되었습니다. 그러나 시각적 입력에서 창작자, 원산지, 시대와 같은 구조화된 문화 메타데이터를 추론하는 연구는 아직 미흡한 실정입니다. 본 연구에서는 이 과제를 위해 다중 범주 및 교차 문화 벤치마크를 도입하고, 참조 주해와의 의미론적 정렬도를 측정하는 LLM-as-Judge 프레임워크를 활용해 VLM 성능을 평가했습니다. 문화적 추론 능력을 평가하기 위해 문화권별로 정확일치, 부분일치, 속성 수준 정확도를 측정한 결과, 모델들이 단편적인 신호만 포착하며 문화권과 메타데이터 유형에 따라 성능 편차가 크게 나타나 일관성 없고 근거가 약한 예측을 보였습니다. 이러한 결과는 시각적 인식을 넘어선 구조화된 문화 메타데이터 추론에서 현재 VLM의 한계를 드러냅니다.

English

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

Appear2Meaning: 이미지에서 구조화된 문화적 메타데이터 추론을 위한 교차 문화 벤치마크

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

초록

Support