ドリブル学：深みのあるナンセンス解釈によるLLMへの挑戦

要旨

本研究では、「深みのあるナンセンス」として特徴づけられる独特の言語現象である「ドリヴェロロジー（Drivelology）」を紹介する。これは、構文的には一貫しているものの、語用的には逆説的、感情的、または修辞的に挑戦的な発話を指す。一見すると表面的なナンセンスに似ているが、その背後には文脈的推論、道徳的推論、または感情的解釈を必要とする暗黙の意味が含まれている。現在の大規模言語モデル（LLM）は、多くの自然言語処理（NLP）タスクで優れた性能を発揮しているにもかかわらず、ドリヴェロロジーの多層的な意味を理解することに一貫して失敗していることが明らかになった。これを調査するため、英語、中国語、スペイン語、フランス語、日本語、韓国語を含む1,200以上の厳選された事例からなる多様なベンチマークデータセットを構築した。アノテーションは特に困難を伴い、各事例がドリヴェロロジーの特性を真に反映していることを確認するために専門家による慎重なレビューが必要であった。このプロセスでは、意見の相違を解消するために複数回の議論と裁定が行われ、ドリヴェロロジーの微妙で主観的な性質が浮き彫りになった。分類、生成、推論タスクにおいて、さまざまなLLMを評価した結果、モデルがドリヴェロロジーを浅薄なナンセンスと混同したり、一貫性のない説明を生成したり、暗黙の修辞機能を完全に見落としたりするなど、明確な限界が明らかになった。これらの発見は、LLMの語用的理解における深い表現のギャップを示しており、統計的な流暢さが認知的意味理解を意味するという仮定に疑問を投げかけている。本研究では、表面的な一貫性を超えた言語的深さをモデル化するためのさらなる研究を促進するため、データセットとコードを公開する。

English

We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth", utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a small but diverse benchmark dataset of over 1,200 meticulously curated examples, with select instances in English, Mandarin, Spanish, French, Japanese, and Korean. Annotation was especially challenging: each of the examples required careful expert review to verify that it truly reflected Drivelological characteristics. The process involved multiple rounds of discussion and adjudication to address disagreements, highlighting the subtle and subjective nature of the Drivelology. We evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss the implied rhetorical function altogether. These findings highlight a deeper representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.

ドリブル学：深みのあるナンセンス解釈によるLLMへの挑戦

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

要旨

Support