LLMの肩に止まる確率的なオウム：物理的概念理解の総括評価

要旨

体系的な方法で、よくある質問である「LLMは本当に自分が言っていることを理解しているのか？」について調査します。これは、より馴染みのある用語である「確率的オウム」と関連しています。このために、私たちは慎重に設計された物理概念理解タスクPhysiCoに対する総合的な評価を提案します。私たちのタスクは、物理現象を抽象的に記述するグリッド形式の入力を使用することで、記憶の問題を緩和します。グリッドは、核となる現象、応用例、グリッドワールド内の他の抽象的なパターンへの類推など、さまざまな理解レベルを表しています。私たちのタスクに関する包括的な研究は次のことを示しています：（1）GPT-4o、o1、Gemini 2.0フラッシュ思考などの最先端のLLMは、人間よりも約40％遅れています；（2）確率的オウム現象はLLMに存在し、彼らは私たちのグリッドタスクで失敗しますが、自然言語で同じ概念をうまく説明し認識できます；（3）私たちのタスクは、LLMにとって未知のグリッド形式ではなく、固有の難しさによって挑戦を与えます。なぜなら、文脈に即した学習や同じフォーマットのデータでの微調整は、彼らのパフォーマンスにほとんど影響を与えなかったからです。

English

In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.

LLMの肩に止まる確率的なオウム：物理的概念理解の総括評価

The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

要旨

Support