再探“棘手难题”：语言模型的语义推理基准测试

摘要

我们推出SemanticQA评估套件，旨在评测语言模型在语义短语处理任务中的表现。该基准测试整合了现有的多词表达资源，并将其重组为统一测试平台，涵盖词汇搭配等通用语言现象，以及惯用语、名词复合词和动词结构三大细分类别。通过SemanticQA，我们对不同架构与规模的模型在抽取、分类、解析任务及序列任务组合中的表现进行评估，发现模型性能存在显著差异——尤其在需要语义推理的任务上，这揭示了不同模型在推理效能与语义理解层面的差距，为提升语言模型对复杂语义短语的理解能力提供了重要参考。SemanticQA的评估框架与数据已开源：https://github.com/jacklanda/SemanticQA。

English

We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.