语言模型模拟语言

摘要

受索绪尔和乔姆斯基理论框架深刻影响的语言学界对大型语言模型（LLMs）的评论，往往流于推测且缺乏建设性。批评者质疑LLMs是否能够真正模拟语言，强调需要“深层结构”或“基础”来实现理想化的语言“能力”。我们主张彻底转变视角，采纳著名普通语言学与历史语言学家维托尔德·马恩恰克的实证主义原则。他将语言定义为“所有被说与写的内容的总和”，而非“符号系统”或“大脑的计算系统”。尤为重要的是，他认定特定语言元素的使用频率为语言的首要支配原则。基于此框架，我们对先前针对LLMs的批评提出挑战，并为语言模型的设计、评估与解读提供了建设性的指导方针。

English

Linguistic commentary on LLMs, heavily influenced by the theoretical frameworks of de Saussure and Chomsky, is often speculative and unproductive. Critics challenge whether LLMs can legitimately model language, citing the need for "deep structure" or "grounding" to achieve an idealized linguistic "competence." We argue for a radical shift in perspective towards the empiricist principles of Witold Ma\'nczak, a prominent general and historical linguist. He defines language not as a "system of signs" or a "computational system of the brain" but as the totality of all that is said and written. Above all, he identifies frequency of use of particular language elements as language's primary governing principle. Using his framework, we challenge prior critiques of LLMs and provide a constructive guide for designing, evaluating, and interpreting language models.

语言模型模拟语言

Language Models Model Language

摘要

Support