评估交互式AI代理中的认知年龄对齐

摘要

尽管主体性人工智能及其核心的多模态大语言模型在从日常生活到前沿科学研究的诸多领域中，展现出在语言和视觉推理方面的卓越潜力，但人工智慧与人类智能之间仍存在巨大差距。即便整合了强大的工具与先进的多模态大语言模型，最先进的AI智能体仍常常在儿童能轻松完成的基础性、看似简单的任务上失败。受韦克斯勒儿童智力量表启发，我们提出了儿童智能体评估基准（ChildAgentEval）——首个基于心理测量学的交互式基准，用于评估基于多模态大语言模型的智能体的认知年龄对齐程度。该基准系统性地比较了多种基于多模态大语言模型的交互式智能体在推理表现上与不同年龄段人类发展阶段的差异，揭示了当前主体性AI系统在模拟特定年龄认知行为方面的能力边界。

English

While agentic AI and its core multimodal large language models (MLLMs) have demonstrated remarkable promise in language and visual reasoning across domains ranging from daily life to advanced scientific research, a profound gap remains between artificial and human intelligence. Despite the integration of powerful tools and advanced MLLMs, state-of-the-art AI agents frequently fail at foundational, seemingly simple tasks that a child can resolve with ease. Inspired by the Wechsler Intelligence Scale for Children (WISC), we introduce ChildAgentEval, the first psychometrically grounded interactive benchmark for evaluating cognitive age alignment in MLLM-based agents. ChildAgentEval systematically compares the reasoning performance of various MLLM-based interactive agents against age-specific human developmental stages, exposing where current agentic AI systems can and cannot simulate age-specific cognitive behavior.