상호작용형 AI 에이전트에서 인지 연령 정합성 평가

초록

에이전트 AI와 그 핵심 기술인 다중 모달 대규모 언어 모델(MLLM)은 일상생활에서 첨단 과학 연구에 이르기까지 다양한 영역에서 언어 및 시각 추론에 있어 놀라운 가능성을 보여주었으나, 인공지능과 인간 지능 사이에는 여전히 큰 격차가 존재한다. 강력한 도구와 고급 MLLM이 통합되었음에도 불구하고, 최첨단 AI 에이전트는 어린아이가 쉽게 해결할 수 있는 기초적이고 단순해 보이는 과제에서 자주 실패한다. 본 연구는 웩슬러 아동 지능 검사(WISC)에서 영감을 얻어, MLLM 기반 에이전트의 인지 연령 일치도를 평가하는 최초의 심리측정학적 기반 대화형 벤치마크인 ChildAgentEval을 제안한다. ChildAgentEval은 다양한 MLLM 기반 대화형 에이전트의 추론 성능을 연령별 인간 발달 단계와 체계적으로 비교하여, 현재 에이전트 AI 시스템이 연령별 인지 행동을 어디까지 모사할 수 있고 모사할 수 없는지를 드러낸다.

English

While agentic AI and its core multimodal large language models (MLLMs) have demonstrated remarkable promise in language and visual reasoning across domains ranging from daily life to advanced scientific research, a profound gap remains between artificial and human intelligence. Despite the integration of powerful tools and advanced MLLMs, state-of-the-art AI agents frequently fail at foundational, seemingly simple tasks that a child can resolve with ease. Inspired by the Wechsler Intelligence Scale for Children (WISC), we introduce ChildAgentEval, the first psychometrically grounded interactive benchmark for evaluating cognitive age alignment in MLLM-based agents. ChildAgentEval systematically compares the reasoning performance of various MLLM-based interactive agents against age-specific human developmental stages, exposing where current agentic AI systems can and cannot simulate age-specific cognitive behavior.