機械と子供の比較：発達心理学実験を用いたLaMDA応答の強みと弱みの評価

要旨

発達心理学者たちは数十年にわたり、乳幼児や子供の知能と知識をテストし、重要な概念や能力の起源を追跡する実験を考案してきた。さらに、発達心理学における実験手法は、特定の行動の基盤となる認知能力を識別するために慎重に設計されている。我々は、子供の発達における古典的な実験を用いることが、一般的なAIモデル、特に大規模言語モデル（LLM）の計算能力を探るための特に効果的な方法であると提案する。第一に、過去の経験を制御するための新規刺激の使用や、子供が単純な連想を使用しているかどうかを判断するための対照条件など、発達心理学の方法論的手法は、LLMの能力を評価するのにも同様に有用である。並行して、この方法でLLMをテストすることで、テキストにエンコードされた情報が特定の応答を可能にするのに十分であるか、またはそれらの応答が物理世界の探索からの情報など、他の種類の情報に依存しているかどうかを知ることができる。本研究では、Googleの大規模言語モデルであるLaMDAの能力を評価するために、古典的な発達実験を適応させた。我々は、GPTなどの他の言語モデルを評価するために使用できる新しいLLM応答スコア（LRS）メトリックを提案する。LaMDAは、社会的理解に関する実験において、子供たちと類似した適切な応答を生成することがわかった。これは、これらの領域の知識が言語を通じて発見されることの証拠を提供しているかもしれない。一方で、LaMDAの初期の物体や行動の理解、心の理論、特に因果推論タスクにおける応答は、幼い子供たちのそれとは大きく異なっており、これらの領域はより現実世界での自己主導的な探索を必要とし、単に言語入力のパターンから学ぶことはできないことを示しているかもしれない。

English

Developmental psychologists have spent decades devising experiments to test the intelligence and knowledge of infants and children, tracing the origin of crucial concepts and capacities. Moreover, experimental techniques in developmental psychology have been carefully designed to discriminate the cognitive capacities that underlie particular behaviors. We propose that using classical experiments from child development is a particularly effective way to probe the computational abilities of AI models, in general, and LLMs in particular. First, the methodological techniques of developmental psychology, such as the use of novel stimuli to control for past experience or control conditions to determine whether children are using simple associations, can be equally helpful for assessing the capacities of LLMs. In parallel, testing LLMs in this way can tell us whether the information that is encoded in text is sufficient to enable particular responses, or whether those responses depend on other kinds of information, such as information from exploration of the physical world. In this work we adapt classical developmental experiments to evaluate the capabilities of LaMDA, a large language model from Google. We propose a novel LLM Response Score (LRS) metric which can be used to evaluate other language models, such as GPT. We find that LaMDA generates appropriate responses that are similar to those of children in experiments involving social understanding, perhaps providing evidence that knowledge of these domains is discovered through language. On the other hand, LaMDA's responses in early object and action understanding, theory of mind, and especially causal reasoning tasks are very different from those of young children, perhaps showing that these domains require more real-world, self-initiated exploration and cannot simply be learned from patterns in language input.

機械と子供の比較：発達心理学実験を用いたLaMDA応答の強みと弱みの評価

Comparing Machines and Children: Using Developmental Psychology Experiments to Assess the Strengths and Weaknesses of LaMDA Responses

要旨

Support