反射ベンチ：反射を用いたAI知能の探査

要旨

予期しない結果に対応して信念や行動を適応させる能力、反射は知的システムが世界とやり取りする際の基本的な要素です。認知科学の観点から見ると、これは人間とAIシステムの両方に適用可能な知能の中核原則となります。大規模言語モデル（LLM）の知能に関する議論に対処するために、Reflection-Benchを提案します。これは、知覚、記憶、信念の更新、意思決定、予測、事実に基づく考え、メタ反射など、反射に不可欠な7つのコア認知機能を網羅する包括的なベンチマークです。OpenAI o1、GPT-4、Claude 3.5 Sonnetなど13の有力なLLMのパフォーマンスを評価します。その結果、現行のLLMはまだ十分な反射能力を欠いていることが示されました。これらの結果の根本的な原因について議論し、将来の研究の可能性を提案します。総括すると、Reflection-Benchは環境と信頼性を持ってやり取りできるAIを開発するための評価ツールとインスピレーションを提供します。データとコードはhttps://github.com/YabYum/ReflectionBench で入手可能です。

English

The ability to adapt beliefs or behaviors in response to unexpected outcomes, reflection, is fundamental to intelligent systems' interaction with the world. From a cognitive science perspective, this serves as a core principle of intelligence applicable to both human and AI systems. To address the debate on the intelligence of large language models (LLMs), we propose Reflection-Bench, a comprehensive benchmark comprising 7 tasks spanning core cognitive functions crucial for reflection, including perception, memory, belief updating, decision-making, prediction, counterfactual thinking, and meta-reflection. We evaluate the performances of 13 prominent LLMs such as OpenAI o1, GPT-4, Claude 3.5 Sonnet, etc. The results indicate that current LLMs still lack satisfactory reflection ability. We discuss the underlying causes of these results and suggest potential avenues for future research. In conclusion, Reflection-Bench offers both evaluation tools and inspiration for developing AI capable of reliably interacting with the environment. Our data and code are available at https://github.com/YabYum/ReflectionBench.

反射ベンチ：反射を用いたAI知能の探査

Reflection-Bench: probing AI intelligence with reflection

要旨

Support