Hanfu-Bench: Um Benchmark Multimodal sobre Compreensão e Transcriação Cultural Trans-Temporal

Resumo

A cultura é um domínio rico e dinâmico que evolui tanto geográfica quanto temporalmente. No entanto, os estudos existentes sobre compreensão cultural com modelos de visão e linguagem (VLMs) enfatizam principalmente a diversidade geográfica, muitas vezes negligenciando as dimensões temporais críticas. Para preencher essa lacuna, apresentamos o Hanfu-Bench, um novo conjunto de dados multimodal curado por especialistas. O Hanfu, uma vestimenta tradicional que abrange as antigas dinastias chinesas, serve como um patrimônio cultural representativo que reflete os aspectos temporais profundos da cultura chinesa, ao mesmo tempo que permanece altamente popular na sociedade contemporânea chinesa. O Hanfu-Bench compreende duas tarefas principais: compreensão visual cultural e transcriação de imagens culturais. A primeira tarefa examina o reconhecimento de características culturais temporais com base em entradas de imagem única ou múltipla por meio de questionários de múltipla escolha com respostas visuais, enquanto a segunda se concentra na transformação de trajes tradicionais em designs modernos por meio da herança de elementos culturais e adaptação ao contexto moderno. Nossa avaliação mostra que os VLMs fechados têm desempenho comparável ao de não especialistas na compreensão visual cultural, mas ficam 10\% aquém dos especialistas humanos, enquanto os VLMs abertos ficam ainda mais atrás dos não especialistas. Para a tarefa de transcriação, uma avaliação humana multifacetada indica que o modelo com melhor desempenho alcança uma taxa de sucesso de apenas 42\%. Nosso benchmark fornece um ambiente de teste essencial, revelando desafios significativos nessa nova direção de compreensão cultural temporal e adaptação criativa.

English

Culture is a rich and dynamic domain that evolves across both geography and time. However, existing studies on cultural understanding with vision-language models (VLMs) primarily emphasize geographic diversity, often overlooking the critical temporal dimensions. To bridge this gap, we introduce Hanfu-Bench, a novel, expert-curated multimodal dataset. Hanfu, a traditional garment spanning ancient Chinese dynasties, serves as a representative cultural heritage that reflects the profound temporal aspects of Chinese culture while remaining highly popular in Chinese contemporary society. Hanfu-Bench comprises two core tasks: cultural visual understanding and cultural image transcreation.The former task examines temporal-cultural feature recognition based on single- or multi-image inputs through multiple-choice visual question answering, while the latter focuses on transforming traditional attire into modern designs through cultural element inheritance and modern context adaptation. Our evaluation shows that closed VLMs perform comparably to non-experts on visual cutural understanding but fall short by 10\% to human experts, while open VLMs lags further behind non-experts. For the transcreation task, multi-faceted human evaluation indicates that the best-performing model achieves a success rate of only 42\%. Our benchmark provides an essential testbed, revealing significant challenges in this new direction of temporal cultural understanding and creative adaptation.

Hanfu-Bench: Um Benchmark Multimodal sobre Compreensão e Transcriação Cultural Trans-Temporal

Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation

Resumo

Support