伟大的头脑想法是否相似？使用CAIMIRA调查人类与人工智能在问答中的互补性

摘要

最近大型语言模型（LLMs）的进展导致了声称人工智能在自然语言处理（NLP）任务中超越人类，如文本理解和推理。本研究通过引入CAIMIRA，这是一个根植于项目反应理论（IRT）的新框架，能够定量评估和比较问题解决能力，包括问答（QA）代理：人类和人工智能系统。通过分析来自约70个人工智能系统和155名人类对数千道测验问题的超过30万次响应，CAIMIRA揭示了知识领域和推理技能中的不同熟练模式。人类在知识基础的演绎和概念推理方面表现优于人工智能系统，而像GPT-4和LLaMA这样的最新LLMs在有针对性的信息检索和基于事实的推理方面表现更优秀，特别是当信息缺失被明确定义并可通过模式匹配或数据检索解决时。这些发现突显了未来问答任务需要专注于挑战不仅是高阶推理和科学思维，还需要要求细致的语言解释和跨语境知识应用的问题，从而推动更好地模拟或补充人类认知能力的真实世界问题解决的人工智能发展。

English

Recent advancements of large language models (LLMs) have led to claims of AI surpassing humans in natural language processing (NLP) tasks such as textual understanding and reasoning. This work investigates these assertions by introducing CAIMIRA, a novel framework rooted in item response theory (IRT) that enables quantitative assessment and comparison of problem-solving abilities of question-answering (QA) agents: humans and AI systems. Through analysis of over 300,000 responses from ~70 AI systems and 155 humans across thousands of quiz questions, CAIMIRA uncovers distinct proficiency patterns in knowledge domains and reasoning skills. Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning, while state-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval and fact-based reasoning, particularly when information gaps are well-defined and addressable through pattern matching or data retrieval. These findings highlight the need for future QA tasks to focus on questions that challenge not only higher-order reasoning and scientific thinking, but also demand nuanced linguistic interpretation and cross-contextual knowledge application, helping advance AI developments that better emulate or complement human cognitive abilities in real-world problem-solving.