掌握变压器是隐式推理者:通往泛化边缘的机械之旅
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
May 23, 2024
作者: Boshi Wang, Xiang Yue, Yu Su, Huan Sun
cs.AI
摘要
我们研究transformers是否能够学会隐式推理参数化知识,这是即使是最有能力的语言模型也难以掌握的技能。专注于两种代表性的推理类型,即组合和比较,我们一贯发现transformers可以学会隐式推理,但只有通过深入训练远超过过拟合的grokking才能实现。推理类型的泛化水平也因类型而异:当面对超出分布范围的示例时,transformers在组合方面未能系统地泛化,但在比较方面取得成功。我们在整个训练过程中深入研究模型内部,进行分析实验揭示:1)grokking背后的机制,例如泛化电路的形成及其与泛化和记忆电路相对效率的关系,以及2)系统性与泛化电路配置之间的联系。我们的发现指导数据和训练设置,以更好地诱导隐式推理,并提出了改进transformer架构的潜在方法,例如鼓励跨层知识共享。此外,我们证明了对于一个具有大搜索空间的具有挑战性的推理任务,基于非参数化记忆的GPT-4-Turbo和Gemini-1.5-Pro无论采用何种提示风格或检索增强,都表现糟糕,而一个完全grokked的transformer可以实现接近完美的准确性,展示了参数化记忆在复杂推理中的强大能力。
English
We study whether transformers can learn to implicitly reason over parametric
knowledge, a skill that even the most capable language models struggle with.
Focusing on two representative reasoning types, composition and comparison, we
consistently find that transformers can learn implicit reasoning, but only
through grokking, i.e., extended training far beyond overfitting. The levels of
generalization also vary across reasoning types: when faced with
out-of-distribution examples, transformers fail to systematically generalize
for composition but succeed for comparison. We delve into the model's internals
throughout training, conducting analytical experiments that reveal: 1) the
mechanism behind grokking, such as the formation of the generalizing circuit
and its relation to the relative efficiency of generalizing and memorizing
circuits, and 2) the connection between systematicity and the configuration of
the generalizing circuit. Our findings guide data and training setup to better
induce implicit reasoning and suggest potential improvements to the transformer
architecture, such as encouraging cross-layer knowledge sharing. Furthermore,
we demonstrate that for a challenging reasoning task with a large search space,
GPT-4-Turbo and Gemini-1.5-Pro based on non-parametric memory fail badly
regardless of prompting styles or retrieval augmentation, while a fully grokked
transformer can achieve near-perfect accuracy, showcasing the power of
parametric memory for complex reasoning.Summary
AI-Generated Summary