超越理解：评估LLM在具象语言文化处理中的语用鸿沟

摘要

我们针对大语言模型处理文化根植语言的能力展开全面评估，重点考察其理解并实际运用蕴含地方知识与文化细微差别的比喻性表达的能力。以比喻性语言作为文化细微差别和地方知识的观测指标，我们设计了针对阿拉伯语和英语的语境理解、语用实践及内涵解读三项评估任务。通过对22个开源与闭源大语言模型进行测试，涵盖埃及阿拉伯语习语、多方言阿拉伯谚语及英语谚语。研究结果呈现出稳定层级：阿拉伯谚语平均准确率较英语谚语低4.29%，而埃及习语的表现又比阿拉伯谚语低10.28%。在语用实践任务中，准确率相较理解任务下降14.07%，但提供包含习语的语境语句可使准确率提升10.66%。模型在内涵意义解读方面也存在困难，即使在标注者间一致性达100%的习语上，模型与人类标注者的最高吻合度也仅为85.58%。这些发现表明比喻性语言可作为文化推理能力的有效诊断工具：大语言模型虽能解读比喻意义，但在恰当运用方面仍面临挑战。为推进后续研究，我们发布了首个专为比喻理解与语用评估设计的埃及阿拉伯语习语数据集Kinayat。

English

We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Our results show a consistent hierarchy: the average accuracy for Arabic proverbs is 4.29% lower than for English proverbs, and performance for Egyptian idioms is 10.28% lower than for Arabic proverbs. For the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing contextual idiomatic sentences improves accuracy by 10.66%. Models also struggle with connotative meaning, reaching at most 85.58% agreement with human annotators on idioms with 100% inter-annotator agreement. These findings demonstrate that figurative language serves as an effective diagnostic for cultural reasoning: while LLMs can often interpret figurative meaning, they face challenges in using it appropriately. To support future research, we release Kinayat, the first dataset of Egyptian Arabic idioms designed for both figurative understanding and pragmatic use evaluation.

超越理解：评估LLM在具象语言文化处理中的语用鸿沟

Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

摘要

Support