不可能なテスト：2024年に解けないデータセットとAGIの可能性を探るクイズ

要旨

この研究は、675の根本的に解決不可能な問題に対する大規模言語モデル（LLMs）の不確実性を認識する能力を評価するために設計された新しい評価フレームワークを紹介しています。意図的に解らない答えを持つ大学レベルのグランドチャレンジ問題のキュレーションされたデータセットを用いて、オープンソースおよびクローズドソースの12の最先端LLMsを評価しました。これらのモデルが、無知を認める傾向があるか、それとも妥当ながらも不正確な回答を生成するかを評価しました。最も優れたモデルは、生物学から哲学、数学までの分野で、問題の解決策が不明であることを認める正解率が62％から68％の範囲でスコアリングされました。問題の難易度とモデルの精度との間には逆の関係があり、GPT-4は、より難しい問題（35.8％）よりも簡単な問題（20.0％）で不確実性を認識する割合が高いことが示されました。このパターンは、問題がより解決可能に見えるとき、モデルが推測的な回答を生成する傾向がある可能性があることを示しています。研究はまた、問題のカテゴリによって著しい変動があり、モデルは発明やNP困難な問題で不確実性を認識するのが難しく、一方で哲学的および心理学的な課題では比較的良い結果を示しました。これらの結果は、人工一般知能（AGI）評価に関する研究の増加に貢献し、不確実性認識が将来の機械知能評価の重要な要素であることを強調しています。この不可能性テストは、現在のLLMsが自らの知識の限界を認識する能力における現在の制限を実証することで、普遍的知能テストの以前の理論的フレームワークを拡張し、モデルの訓練アーキテクチャや評価手法を改善する新たな方向性を示唆しています。

English

This research introduces a novel evaluation framework designed to assess large language models' (LLMs) ability to acknowledge uncertainty on 675 fundamentally unsolvable problems. Using a curated dataset of graduate-level grand challenge questions with intentionally unknowable answers, we evaluated twelve state-of-the-art LLMs, including both open and closed-source models, on their propensity to admit ignorance rather than generate plausible but incorrect responses. The best models scored in 62-68% accuracy ranges for admitting the problem solution was unknown in fields ranging from biology to philosophy and mathematics. We observed an inverse relationship between problem difficulty and model accuracy, with GPT-4 demonstrating higher rates of uncertainty acknowledgment on more challenging problems (35.8%) compared to simpler ones (20.0%). This pattern indicates that models may be more prone to generate speculative answers when problems appear more tractable. The study also revealed significant variations across problem categories, with models showing difficulty in acknowledging uncertainty in invention and NP-hard problems while performing relatively better on philosophical and psychological challenges. These results contribute to the growing body of research on artificial general intelligence (AGI) assessment by highlighting the importance of uncertainty recognition as a critical component of future machine intelligence evaluation. This impossibility test thus extends previous theoretical frameworks for universal intelligence testing by providing empirical evidence of current limitations in LLMs' ability to recognize their own knowledge boundaries, suggesting new directions for improving model training architectures and evaluation approaches.

不可能なテスト：2024年に解けないデータセットとAGIの可能性を探るクイズ

The Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI Quiz

要旨

Support