評估互動式AI代理中的認知年齡對齊

摘要

尽管自主型AI及其核心的多模态大语言模型（MLLMs）在从日常生活到前沿科学研究等领域中，已在语言与视觉推理方面展现出显著潜力，但人工智能与人类智能之间仍存在深刻差距。即便集成了强大工具与先进MLLMs，当前最先进的AI智能体仍频繁在儿童能轻松解决的基础性、看似简单的任务上失败。受韦氏儿童智力量表（WISC）启发，我们提出ChildAgentEval——首个基于心理测量学的交互式基准，用于评估基于MLLM的智能体在认知年龄上的对齐程度。ChildAgentEval系统性地将各类基于MLLM的交互式智能体的推理表现与特定年龄段的人类发展阶段进行对比，揭示了当前自主型AI系统在哪些方面能够或无法模拟特定年龄段的认知行为。

English

While agentic AI and its core multimodal large language models (MLLMs) have demonstrated remarkable promise in language and visual reasoning across domains ranging from daily life to advanced scientific research, a profound gap remains between artificial and human intelligence. Despite the integration of powerful tools and advanced MLLMs, state-of-the-art AI agents frequently fail at foundational, seemingly simple tasks that a child can resolve with ease. Inspired by the Wechsler Intelligence Scale for Children (WISC), we introduce ChildAgentEval, the first psychometrically grounded interactive benchmark for evaluating cognitive age alignment in MLLM-based agents. ChildAgentEval systematically compares the reasoning performance of various MLLM-based interactive agents against age-specific human developmental stages, exposing where current agentic AI systems can and cannot simulate age-specific cognitive behavior.