語言模型是單射的,因此可逆。
Language Models are Injective and Hence Invertible
October 17, 2025
作者: Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodola'
cs.AI
摘要
Transformer组件如非线性激活函数和归一化操作本质上是非单射的,这意味着不同的输入可能映射到相同的输出,从而阻碍从模型的表示中精确恢复输入。在本文中,我们对此观点提出挑战。首先,我们从数学上证明了将离散输入序列映射到其对应连续表示序列的Transformer语言模型是单射的,因此是无损的,这一性质在初始化时确立并在训练过程中得以保持。其次,我们通过对六个最先进的语言模型进行数十亿次碰撞测试,实证验证了这一结果,且未观察到任何碰撞。第三,我们将单射性付诸实践:我们引入了SipIt,这是首个能够从隐藏激活中可证明且高效地重构出精确输入文本的算法,确立了线性时间保证,并在实践中展示了精确的可逆性。总体而言,我们的工作确立了单射性作为语言模型的一个基本且可被利用的属性,对透明度、可解释性及安全部署具有直接意义。
English
Transformer components such as non-linear activations and normalization are
inherently non-injective, suggesting that different inputs could map to the
same output and prevent exact recovery of the input from a model's
representations. In this paper, we challenge this view. First, we prove
mathematically that transformer language models mapping discrete input
sequences to their corresponding sequence of continuous representations are
injective and therefore lossless, a property established at initialization and
preserved during training. Second, we confirm this result empirically through
billions of collision tests on six state-of-the-art language models, and
observe no collisions. Third, we operationalize injectivity: we introduce
SipIt, the first algorithm that provably and efficiently reconstructs the exact
input text from hidden activations, establishing linear-time guarantees and
demonstrating exact invertibility in practice. Overall, our work establishes
injectivity as a fundamental and exploitable property of language models, with
direct implications for transparency, interpretability, and safe deployment.