首の痛みを再考する：言語モデルのための意味論的推論ベンチマーク

要旨

本研究では、言語モデル（LM）の意味的句処理タスクを評価するためのフレームワーク「SemanticQA」を提案する。本ベンチマークは、既存の多語表現（MwE）リソースを統合し、統一されたテストベッドとして再構築したものである。語彙的連語のような一般的な言語現象に加え、慣用句、複合名詞、動詞構文という3つの細分類を網羅している。SemanticQAを通じて、様々なアーキテクチャと規模のLMに対し、抽出・分類・解釈タスク、および連続的タスク構成の評価を実施した。その結果、特に意味推論を要するタスクにおいて性能に大幅なばらつきが見られ、LMの推論効率と意味理解に差異があることが明らかとなった。これは、非自明な意味的句に対する理解力を強化したLMの開発に向けた重要な知見を提供する。SemanticQAの評価フレームワークとデータはhttps://github.com/jacklanda/SemanticQA で公開されている。

English

We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.

首の痛みを再考する：言語モデルのための意味論的推論ベンチマーク

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

要旨

Support