在现实世界中实现多步推理:基于Transformer的数据增强方法探索
Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers
April 29, 2025
作者: Roman Abramov, Felix Steinbauer, Gjergji Kasneci
cs.AI
摘要
Transformer模型在众多自然语言处理任务中取得了巨大成功,但在多步事实推理方面仍存在显著不足,尤其是在现实世界知识稀缺的情况下。近期关于“顿悟”(grokking)的研究表明,神经网络一旦识别出潜在的逻辑模式,就能从记忆过渡到完美泛化——然而这些研究主要使用小型合成任务。本文首次将“顿悟”扩展到现实世界的事实数据,并通过精心设计的合成数据增强现有知识图谱,以应对数据集稀疏的挑战,将推理事实与原子事实的比例phi_r提升至“顿悟”所需的阈值之上。令人惊讶的是,我们发现即使是事实错误的合成数据也能强化涌现的推理回路,而非降低准确性,因为它迫使模型依赖关系结构而非记忆。在多跳推理基准测试中,我们的方法在2WikiMultiHopQA上达到了95-100%的准确率,显著超越了强基线模型,并匹配或超越了当前的最先进结果。我们进一步深入分析了提高phi_r如何驱动Transformer内部泛化回路的形成。我们的研究结果表明,基于“顿悟”的数据增强能够释放隐式的多跳推理能力,为大规模语言模型实现更稳健且可解释的事实推理开辟了道路。
English
Transformers have achieved great success in numerous NLP tasks but continue
to exhibit notable gaps in multi-step factual reasoning, especially when
real-world knowledge is sparse. Recent advances in grokking have demonstrated
that neural networks can transition from memorizing to perfectly generalizing
once they detect underlying logical patterns - yet these studies have primarily
used small, synthetic tasks. In this paper, for the first time, we extend
grokking to real-world factual data and address the challenge of dataset
sparsity by augmenting existing knowledge graphs with carefully designed
synthetic data to raise the ratio phi_r of inferred facts to atomic facts
above the threshold required for grokking. Surprisingly, we find that even
factually incorrect synthetic data can strengthen emergent reasoning circuits
rather than degrade accuracy, as it forces the model to rely on relational
structure rather than memorization. When evaluated on multi-hop reasoning
benchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA -
substantially improving over strong baselines and matching or exceeding current
state-of-the-art results. We further provide an in-depth analysis of how
increasing phi_r drives the formation of generalizing circuits inside
Transformers. Our findings suggest that grokking-based data augmentation can
unlock implicit multi-hop reasoning capabilities, opening the door to more
robust and interpretable factual reasoning in large-scale language models.