기계와 아동 비교: 발달 심리학 실험을 활용한 LaMDA 응답의 강점과 약점 평가

초록

발달 심리학자들은 수십 년 동안 유아와 아동의 지능과 지식을 테스트하고 중요한 개념과 능력의 기원을 추적하기 위한 실험을 고안해 왔습니다. 더 나아가, 발달 심리학의 실험 기법은 특정 행동을 뒷받침하는 인지 능력을 구별하기 위해 신중하게 설계되었습니다. 우리는 아동 발달 분야의 고전적인 실험을 활용하는 것이 일반적으로 AI 모델, 특히 대형 언어 모델(LLM)의 계산 능력을 탐구하는 데 특히 효과적인 방법이라고 제안합니다. 첫째, 발달 심리학의 방법론적 기법, 예를 들어 과거 경험을 통제하기 위한 새로운 자극의 사용이나 단순한 연관성을 사용하는지 여부를 판단하기 위한 통제 조건 등은 LLM의 능력을 평가하는 데 동등하게 유용할 수 있습니다. 동시에, 이러한 방식으로 LLM을 테스트함으로써 텍스트에 인코딩된 정보가 특정 반응을 가능하게 하는 데 충분한지, 아니면 물리적 세계 탐색과 같은 다른 종류의 정보에 의존하는지 여부를 알 수 있습니다. 본 연구에서는 구글의 대형 언어 모델인 LaMDA의 능력을 평가하기 위해 고전적인 발달 실험을 적용했습니다. 우리는 GPT와 같은 다른 언어 모델을 평가하는 데 사용할 수 있는 새로운 LLM 응답 점수(LRS) 메트릭을 제안합니다. 우리는 LaMDA가 사회적 이해와 관련된 실험에서 아동과 유사한 적절한 응답을 생성한다는 것을 발견했는데, 이는 이러한 영역의 지식이 언어를 통해 발견된다는 증거를 제공할 수 있습니다. 반면, LaMDA의 초기 물체 및 행동 이해, 마음 이론, 특히 인과적 추론 과제에서의 응답은 어린 아동과 매우 달랐는데, 이는 이러한 영역이 더 많은 현실 세계에서의 자기 주도적 탐색을 필요로 하며 단순히 언어 입력의 패턴으로부터 학습될 수 없음을 보여줄 수 있습니다.

English

Developmental psychologists have spent decades devising experiments to test the intelligence and knowledge of infants and children, tracing the origin of crucial concepts and capacities. Moreover, experimental techniques in developmental psychology have been carefully designed to discriminate the cognitive capacities that underlie particular behaviors. We propose that using classical experiments from child development is a particularly effective way to probe the computational abilities of AI models, in general, and LLMs in particular. First, the methodological techniques of developmental psychology, such as the use of novel stimuli to control for past experience or control conditions to determine whether children are using simple associations, can be equally helpful for assessing the capacities of LLMs. In parallel, testing LLMs in this way can tell us whether the information that is encoded in text is sufficient to enable particular responses, or whether those responses depend on other kinds of information, such as information from exploration of the physical world. In this work we adapt classical developmental experiments to evaluate the capabilities of LaMDA, a large language model from Google. We propose a novel LLM Response Score (LRS) metric which can be used to evaluate other language models, such as GPT. We find that LaMDA generates appropriate responses that are similar to those of children in experiments involving social understanding, perhaps providing evidence that knowledge of these domains is discovered through language. On the other hand, LaMDA's responses in early object and action understanding, theory of mind, and especially causal reasoning tasks are very different from those of young children, perhaps showing that these domains require more real-world, self-initiated exploration and cannot simply be learned from patterns in language input.

기계와 아동 비교: 발달 심리학 실험을 활용한 LaMDA 응답의 강점과 약점 평가

Comparing Machines and Children: Using Developmental Psychology Experiments to Assess the Strengths and Weaknesses of LaMDA Responses

초록

Support