ChatPaper.aiChatPaper

超越理解:评估大语言模型在文化语境中处理比喻性语言的语用差距

Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

October 27, 2025
作者: Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab
cs.AI

摘要

我们针对大型语言模型处理文化根植语言的能力进行了全面评估,重点考察其理解并实际运用蕴含地方知识与文化意蕴的比喻性表达的能力。通过以比喻语言作为文化意蕴与地方知识的表征指标,我们设计了针对阿拉伯语和英语的语境理解、语用实践及内涵解读三项评估任务。在对22个开源与闭源LLMs进行埃及阿拉伯语习语、多方言阿拉伯谚语及英语谚语的测试后,研究结果呈现出稳定层级:阿拉伯谚语平均准确率较英语谚语低4.29%,而埃及习语的表现又比阿拉伯谚语低10.28%。在语用实践任务中,准确率相较理解任务下降14.07%,但提供包含习语的语境语句可使准确率提升10.66%。模型在内涵意义理解方面亦存在困难,即使在标注者间一致性达100%的习语上,模型与人工标注的最大吻合度也仅为85.58%。这些发现表明比喻语言可作为文化推理的有效诊断工具:虽然LLMs常能解读比喻意义,但在恰当运用方面仍面临挑战。为支持后续研究,我们发布了Kinayat数据集——首个专为比喻理解与语用评估设计的埃及阿拉伯语习语资源。
English
We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Our results show a consistent hierarchy: the average accuracy for Arabic proverbs is 4.29% lower than for English proverbs, and performance for Egyptian idioms is 10.28% lower than for Arabic proverbs. For the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing contextual idiomatic sentences improves accuracy by 10.66%. Models also struggle with connotative meaning, reaching at most 85.58% agreement with human annotators on idioms with 100% inter-annotator agreement. These findings demonstrate that figurative language serves as an effective diagnostic for cultural reasoning: while LLMs can often interpret figurative meaning, they face challenges in using it appropriately. To support future research, we release Kinayat, the first dataset of Egyptian Arabic idioms designed for both figurative understanding and pragmatic use evaluation.
PDF11December 1, 2025