重探“棘手难题”:语言模型的语义推理基准测试
Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
April 17, 2026
作者: Yang Liu, Hongming Li, Melissa Xiaohui Qin, Qiankun Liu, Chao Huang
cs.AI
摘要
我们推出SemanticQA评估套件,旨在评估语言模型在语义短语处理任务中的表现。该基准整合了现有的多词表达式资源,并将其重组为统一测试平台,涵盖词汇搭配等通用语言现象,以及惯用语、名词复合词和动词结构三大细分类别。通过SemanticQA,我们对不同架构与规模的LM进行抽取、分类、释义及序列任务组合的评估,发现模型性能存在显著差异——尤其在需要语义推理的任务上,凸显出各LM在推理效能与语义理解层面的区别,为提升模型对复杂语义短语的理解能力提供了重要洞见。SemanticQA的评估框架与数据已开源:https://github.com/jacklanda/SemanticQA。
English
We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.