対話型AIエージェントにおける認知年齢整合性の評価

要旨

エージェンティックAIおよびその中核をなすマルチモーダル大規模言語モデル（MLLM）は、日常生活から先端科学研究に至るまで、言語および視覚推論において顕著な可能性を示してきた。しかしながら、人工知能と人間の知能の間には依然として大きな隔たりが存在する。強力なツールや高度なMLLMが統合されているにもかかわらず、最先端のAIエージェントは、子どもであれば容易に解決できる基礎的で一見単純なタスクにおいて頻繁に失敗する。本研究では、ウェクスラー式児童知能検査（WISC）に着想を得て、MLLMベースのエージェントにおける認知年齢の一致度を評価するための、初の心理測定学的に基づいた対話型ベンチマークであるChildAgentEvalを提案する。ChildAgentEvalは、様々なMLLMベースの対話型エージェントの推論性能を、年齢別の人間の発達段階と体系的に比較し、現在のエージェンティックAIシステムがどこで年齢特異的な認知行動を模倣でき、どこで模倣できないかを明らかにする。

English

While agentic AI and its core multimodal large language models (MLLMs) have demonstrated remarkable promise in language and visual reasoning across domains ranging from daily life to advanced scientific research, a profound gap remains between artificial and human intelligence. Despite the integration of powerful tools and advanced MLLMs, state-of-the-art AI agents frequently fail at foundational, seemingly simple tasks that a child can resolve with ease. Inspired by the Wechsler Intelligence Scale for Children (WISC), we introduce ChildAgentEval, the first psychometrically grounded interactive benchmark for evaluating cognitive age alignment in MLLM-based agents. ChildAgentEval systematically compares the reasoning performance of various MLLM-based interactive agents against age-specific human developmental stages, exposing where current agentic AI systems can and cannot simulate age-specific cognitive behavior.