LLM 肩上的隨機鸚鵡:對物理概念理解的總結評估
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding
February 13, 2025
作者: Mo Yu, Lemao Liu, Junjie Wu, Tsz Ting Chung, Shunchi Zhang, Jiangnan Li, Dit-Yan Yeung, Jie Zhou
cs.AI
摘要
我們系統性地探討一個廣泛討論的問題:LLM 是否真正理解自己所說的話?這與更為熟悉的「隨機鸚鵡」術語有關。為此,我們提出了一個經過精心設計的物理概念理解任務 PhysiCo 的綜合評估。我們的任務通過使用抽象描述物理現象的網格格式輸入來緩解記憶問題。這些網格代表不同程度的理解,從核心現象、應用示例到與網格世界中其他抽象模式的類比。對我們任務的全面研究表明:(1)包括 GPT-4o、o1 和 Gemini 2.0 快閃思維在內的最先進的 LLMs 落後於人類約 40%;(2)隨機鸚鵡現象存在於 LLMs 中,因為它們在我們的網格任務上失敗,但可以在自然語言中很好地描述和識別相同的概念;(3)我們的任務挑戰 LLMs 是由於內在困難,而不是不熟悉的網格格式,因為在相同格式的數據上的上下文學習和微調對它們的表現幫助不大。
English
In a systematic way, we investigate a widely asked question: Do LLMs really
understand what they say?, which relates to the more familiar term Stochastic
Parrot. To this end, we propose a summative assessment over a carefully
designed physical concept understanding task, PhysiCo. Our task alleviates the
memorization issue via the usage of grid-format inputs that abstractly describe
physical phenomena. The grids represents varying levels of understanding, from
the core phenomenon, application examples to analogies to other abstract
patterns in the grid world. A comprehensive study on our task demonstrates: (1)
state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag
behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs,
as they fail on our grid task but can describe and recognize the same concepts
well in natural language; (3) our task challenges the LLMs due to intrinsic
difficulties rather than the unfamiliar grid format, as in-context learning and
fine-tuning on same formatted data added little to their performance.Summary
AI-Generated Summary