偉大なる知識は同じように考えるか？CAIMIRAを用いた質問応答における人間とAIの補完性の調査

要旨

最近の大規模言語モデル（LLMs）の進歩により、自然言語処理（NLP）の分野で、テキスト理解や推論などの課題においてAIが人間を凌駕するという主張がなされています。本研究は、項目反応理論（IRT）に基づく新しいフレームワークであるCAIMIRAを導入することで、人間とAIシステムの問題解決能力を定量的に評価・比較することで、これらの主張を調査しています。約70のAIシステムと155人間からの30万以上の回答を分析することで、CAIMIRAは、知識領域と推論スキルにおける異なる熟練度パターンを明らかにします。人間は、知識に基づく帰納的および概念的推論においてAIシステムを上回りますが、GPT-4やLLaMAなどの最先端のLLMsは、情報の取得や事実に基づく推論において優れた性能を示します。特に、情報の欠如が明確であり、パターンマッチングやデータ検索を通じて対処可能な場合に優れたパフォーマンスを発揮します。これらの知見は、将来のQA課題が、高次の推論や科学的思考だけでなく、微妙な言語解釈や複合的な知識応用を要求する問題に焦点を当てる必要性を強調し、現実世界の問題解決において人間の認知能力をよりよく模倣または補完するAIの発展を支援することが求められます。

English

Recent advancements of large language models (LLMs) have led to claims of AI surpassing humans in natural language processing (NLP) tasks such as textual understanding and reasoning. This work investigates these assertions by introducing CAIMIRA, a novel framework rooted in item response theory (IRT) that enables quantitative assessment and comparison of problem-solving abilities of question-answering (QA) agents: humans and AI systems. Through analysis of over 300,000 responses from ~70 AI systems and 155 humans across thousands of quiz questions, CAIMIRA uncovers distinct proficiency patterns in knowledge domains and reasoning skills. Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning, while state-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval and fact-based reasoning, particularly when information gaps are well-defined and addressable through pattern matching or data retrieval. These findings highlight the need for future QA tasks to focus on questions that challenge not only higher-order reasoning and scientific thinking, but also demand nuanced linguistic interpretation and cross-contextual knowledge application, helping advance AI developments that better emulate or complement human cognitive abilities in real-world problem-solving.