偉大的思想是否會心心相印？通過CAIMIRA探討人工智慧在問答中的互補性

摘要

最近大型語言模型（LLMs）的進步引發了人工智慧在自然語言處理（NLP）任務中超越人類的主張，例如文本理解和推理。本研究通過引入CAIMIRA，一個根植於項目反應理論（IRT）的新框架，來探討這些主張，該框架使得可以量化評估和比較問答（QA）代理人（包括人類和人工智慧系統）的解決問題能力。通過分析來自約70個人工智慧系統和155名人類對數千個測驗問題的30萬多個回答，CAIMIRA揭示了知識領域和推理技能中的明顯熟練模式。人類在知識基礎的演繹和概念推理方面表現優於人工智慧系統，而像GPT-4和LLaMA這樣的最先進的LLMs在針對性信息檢索和基於事實的推理方面表現優異，特別是當信息缺口被明確定義並且可以通過模式匹配或數據檢索來解決時。這些發現突顯了未來問答任務需要專注於不僅挑戰高階推理和科學思維，還要求細緻的語言解釋和跨文本知識應用的問題，從而有助於推動更好地模擬或補充人類認知能力的真實世界問題解決的人工智慧發展。

English

Recent advancements of large language models (LLMs) have led to claims of AI surpassing humans in natural language processing (NLP) tasks such as textual understanding and reasoning. This work investigates these assertions by introducing CAIMIRA, a novel framework rooted in item response theory (IRT) that enables quantitative assessment and comparison of problem-solving abilities of question-answering (QA) agents: humans and AI systems. Through analysis of over 300,000 responses from ~70 AI systems and 155 humans across thousands of quiz questions, CAIMIRA uncovers distinct proficiency patterns in knowledge domains and reasoning skills. Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning, while state-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval and fact-based reasoning, particularly when information gaps are well-defined and addressable through pattern matching or data retrieval. These findings highlight the need for future QA tasks to focus on questions that challenge not only higher-order reasoning and scientific thinking, but also demand nuanced linguistic interpretation and cross-contextual knowledge application, helping advance AI developments that better emulate or complement human cognitive abilities in real-world problem-solving.

偉大的思想是否會心心相印？通過CAIMIRA探討人工智慧在問答中的互補性

Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA

摘要

Support