理解的Transformer是隱式推理者:通往泛化邊緣的機械之旅
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
May 23, 2024
作者: Boshi Wang, Xiang Yue, Yu Su, Huan Sun
cs.AI
摘要
我們研究了transformer是否能夠學會對參數化知識進行隱式推理,這是即使對於最具能力的語言模型來說也很具挑戰性的技能。著重於兩種具代表性的推理類型,即組合和比較,我們一貫地發現transformer能夠學會隱式推理,但只有通過深入訓練,即遠超過過度擬合的程度,才能實現。推理類型的泛化水平也因情況而異:當面對分布外的例子時,transformer在組合方面無法系統性地泛化,但在比較方面則取得成功。我們在整個訓練過程中深入研究模型的內部,進行分析實驗揭示:1)grokking背後的機制,例如泛化電路的形成及其與泛化和記憶電路的相對效率之間的關係,以及2)系統性與泛化電路配置之間的聯繫。我們的研究結果指導了數據和訓練設置,以更好地誘導隱式推理,並提出了對transformer架構的潛在改進,例如鼓勵跨層知識共享。此外,我們展示了對於一個具有大型搜索空間的具挑戰性推理任務,基於非參數化記憶的GPT-4-Turbo和Gemini-1.5-Pro無論提示風格或檢索增強如何,表現都很糟糕,而一個完全grokked的transformer可以實現接近完美的準確性,展示了參數化記憶在複雜推理中的威力。
English
We study whether transformers can learn to implicitly reason over parametric
knowledge, a skill that even the most capable language models struggle with.
Focusing on two representative reasoning types, composition and comparison, we
consistently find that transformers can learn implicit reasoning, but only
through grokking, i.e., extended training far beyond overfitting. The levels of
generalization also vary across reasoning types: when faced with
out-of-distribution examples, transformers fail to systematically generalize
for composition but succeed for comparison. We delve into the model's internals
throughout training, conducting analytical experiments that reveal: 1) the
mechanism behind grokking, such as the formation of the generalizing circuit
and its relation to the relative efficiency of generalizing and memorizing
circuits, and 2) the connection between systematicity and the configuration of
the generalizing circuit. Our findings guide data and training setup to better
induce implicit reasoning and suggest potential improvements to the transformer
architecture, such as encouraging cross-layer knowledge sharing. Furthermore,
we demonstrate that for a challenging reasoning task with a large search space,
GPT-4-Turbo and Gemini-1.5-Pro based on non-parametric memory fail badly
regardless of prompting styles or retrieval augmentation, while a fully grokked
transformer can achieve near-perfect accuracy, showcasing the power of
parametric memory for complex reasoning.Summary
AI-Generated Summary