比較機器與兒童：使用發展心理學實驗評估 LaMDA 回應的優勢與劣勢

摘要

發展心理學家花了數十年的時間設計實驗，以測試嬰兒和兒童的智力和知識，追溯關鍵概念和能力的起源。此外，發展心理學中的實驗技術被精心設計，以區分支持特定行為的認知能力。我們提議利用兒童發展中的經典實驗，是探究人工智能模型的計算能力，總體而言，尤其是大型語言模型的一種特別有效的方法。首先，發展心理學的方法技術，例如使用新穎刺激來控制過去經驗或控制條件以確定兒童是否使用簡單聯想，同樣有助於評估大型語言模型的能力。同時，以這種方式測試大型語言模型可以告訴我們，文本中編碼的信息是否足以啟用特定回應，或者這些回應是否依賴於其他類型的信息，例如來自探索物理世界的信息。在這項工作中，我們改編了經典的發展實驗，以評估Google的大型語言模型LaMDA的能力。我們提出了一個新穎的語言模型回應分數（LRS）指標，可用於評估其他語言模型，如GPT。我們發現LaMDA生成的回應與涉及社會理解的實驗中兒童的回應相似，也許提供了證據表明這些領域的知識是通過語言發現的。另一方面，LaMDA在早期對象和行動理解、心靈理論，尤其是因果推理任務中的回應與年幼兒童的回應非常不同，也許顯示這些領域需要更多真實世界的自主探索，並不能僅僅從語言輸入的模式中學習。

English

Developmental psychologists have spent decades devising experiments to test the intelligence and knowledge of infants and children, tracing the origin of crucial concepts and capacities. Moreover, experimental techniques in developmental psychology have been carefully designed to discriminate the cognitive capacities that underlie particular behaviors. We propose that using classical experiments from child development is a particularly effective way to probe the computational abilities of AI models, in general, and LLMs in particular. First, the methodological techniques of developmental psychology, such as the use of novel stimuli to control for past experience or control conditions to determine whether children are using simple associations, can be equally helpful for assessing the capacities of LLMs. In parallel, testing LLMs in this way can tell us whether the information that is encoded in text is sufficient to enable particular responses, or whether those responses depend on other kinds of information, such as information from exploration of the physical world. In this work we adapt classical developmental experiments to evaluate the capabilities of LaMDA, a large language model from Google. We propose a novel LLM Response Score (LRS) metric which can be used to evaluate other language models, such as GPT. We find that LaMDA generates appropriate responses that are similar to those of children in experiments involving social understanding, perhaps providing evidence that knowledge of these domains is discovered through language. On the other hand, LaMDA's responses in early object and action understanding, theory of mind, and especially causal reasoning tasks are very different from those of young children, perhaps showing that these domains require more real-world, self-initiated exploration and cannot simply be learned from patterns in language input.

比較機器與兒童：使用發展心理學實驗評估 LaMDA 回應的優勢與劣勢

Comparing Machines and Children: Using Developmental Psychology Experiments to Assess the Strengths and Weaknesses of LaMDA Responses

摘要

Support