比较机器和儿童:使用发展心理学实验评估LaMDA回复的优势和劣势
Comparing Machines and Children: Using Developmental Psychology Experiments to Assess the Strengths and Weaknesses of LaMDA Responses
May 18, 2023
作者: Eliza Kosoy, Emily Rose Reagan, Leslie Lai, Alison Gopnik, Danielle Krettek Cobb
cs.AI
摘要
发展心理学家花费数十年时间设计实验,测试婴儿和儿童的智力和知识,追溯关键概念和能力的起源。此外,发展心理学中的实验技术被精心设计,以区分支撑特定行为的认知能力。我们提出,利用儿童发展中的经典实验来探究AI模型的计算能力是一种特别有效的方式,尤其是对LLMs而言。首先,发展心理学的方法论技术,例如使用新颖刺激来控制过去经验或控制条件以确定儿童是否使用简单关联,同样有助于评估LLMs的能力。同时,通过这种方式测试LLMs可以告诉我们,文本中编码的信息是否足以实现特定响应,或者这些响应是否依赖于其他类型的信息,比如来自探索物理世界的信息。在这项工作中,我们改编经典的发展实验来评估Google的大型语言模型LaMDA的能力。我们提出了一种新颖的LLM响应评分(LRS)指标,可用于评估其他语言模型,如GPT。我们发现LaMDA生成的适当响应与涉及社会理解的实验中儿童的响应相似,也许证明了这些领域的知识是通过语言发现的证据。另一方面,LaMDA在早期物体和行为理解、心灵理论,尤其是因果推理任务中的响应与年幼儿童的响应非常不同,也许表明这些领域需要更多真实世界的自主探索,而不能简单地从语言输入的模式中学习。
English
Developmental psychologists have spent decades devising experiments to test
the intelligence and knowledge of infants and children, tracing the origin of
crucial concepts and capacities. Moreover, experimental techniques in
developmental psychology have been carefully designed to discriminate the
cognitive capacities that underlie particular behaviors. We propose that using
classical experiments from child development is a particularly effective way to
probe the computational abilities of AI models, in general, and LLMs in
particular. First, the methodological techniques of developmental psychology,
such as the use of novel stimuli to control for past experience or control
conditions to determine whether children are using simple associations, can be
equally helpful for assessing the capacities of LLMs. In parallel, testing LLMs
in this way can tell us whether the information that is encoded in text is
sufficient to enable particular responses, or whether those responses depend on
other kinds of information, such as information from exploration of the
physical world. In this work we adapt classical developmental experiments to
evaluate the capabilities of LaMDA, a large language model from Google. We
propose a novel LLM Response Score (LRS) metric which can be used to evaluate
other language models, such as GPT. We find that LaMDA generates appropriate
responses that are similar to those of children in experiments involving social
understanding, perhaps providing evidence that knowledge of these domains is
discovered through language. On the other hand, LaMDA's responses in early
object and action understanding, theory of mind, and especially causal
reasoning tasks are very different from those of young children, perhaps
showing that these domains require more real-world, self-initiated exploration
and cannot simply be learned from patterns in language input.