废话学:以深度解读无意义内容挑战大语言模型
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
September 4, 2025
作者: Yang Wang, Chenghao Xiao, Chia-Yi Hsiao, Zi Yan Chang, Chi-Li Chen, Tyler Loakman, Chenghua Lin
cs.AI
摘要
我们引入“Drivelology”这一独特的语言现象,其特征被描述为“蕴含深意的无稽之谈”——这些话语在句法上连贯,却在语用上自相矛盾、情感丰富或修辞颠覆。尽管此类表达可能看似肤浅的无意义,它们却编码了需要上下文推理、道德判断或情感解读的隐含意义。我们发现,当前的大型语言模型(LLMs)虽然在众多自然语言处理(NLP)任务中表现出色,却始终难以把握Drivelology文本的多层语义。为探究此问题,我们构建了一个小而多样的基准数据集,包含1200多个精心挑选的示例,涵盖英语、汉语、西班牙语、法语、日语和韩语。标注过程尤为复杂:每个示例均需经过专家细致审查,以确保其真实反映Drivelology特征。这一过程涉及多轮讨论与裁决,以解决分歧,凸显了Drivelology微妙且主观的本质。我们评估了一系列LLMs在分类、生成和推理任务上的表现。结果显示,LLMs存在明显局限:模型常将Drivelology与浅层无意义混淆,产生不连贯的辩解,或完全忽略隐含的修辞功能。这些发现揭示了LLMs在语用理解上的深层表征缺陷,并挑战了统计流畅性即意味着认知理解的假设。我们公开数据集与代码,以促进在超越表面连贯性的语言深度建模方面的进一步研究。
English
We introduce Drivelology, a unique linguistic phenomenon characterised as
"nonsense with depth", utterances that are syntactically coherent yet
pragmatically paradoxical, emotionally loaded, or rhetorically subversive.
While such expressions may resemble surface-level nonsense, they encode
implicit meaning requiring contextual inference, moral reasoning, or emotional
interpretation. We find that current large language models (LLMs), despite
excelling at many natural language processing (NLP) tasks, consistently fail to
grasp the layered semantics of Drivelological text. To investigate this, we
construct a small but diverse benchmark dataset of over 1,200 meticulously
curated examples, with select instances in English, Mandarin, Spanish, French,
Japanese, and Korean. Annotation was especially challenging: each of the
examples required careful expert review to verify that it truly reflected
Drivelological characteristics. The process involved multiple rounds of
discussion and adjudication to address disagreements, highlighting the subtle
and subjective nature of the Drivelology. We evaluate a range of LLMs on
classification, generation, and reasoning tasks. Our results reveal clear
limitations of LLMs: models often confuse Drivelology with shallow nonsense,
produce incoherent justifications, or miss the implied rhetorical function
altogether. These findings highlight a deeper representational gap in LLMs'
pragmatic understanding and challenge the assumption that statistical fluency
implies cognitive comprehension. We release our dataset and code to facilitate
further research in modelling linguistic depth beyond surface-level coherence.