语言模型是单射的，因此可逆。

摘要

Transformer组件中的非线性激活函数和归一化操作本质上是非单射的，这意味着不同的输入可能映射到相同的输出，从而阻碍从模型的表示中精确恢复输入。本文中，我们对此观点提出挑战。首先，我们从数学上证明了将离散输入序列映射为连续表示序列的Transformer语言模型是单射的，因此是无损的，这一性质在初始化时确立并在训练过程中得以保持。其次，通过对六个最先进的语言模型进行数十亿次碰撞测试，我们实证验证了这一结果，且未观察到任何碰撞。第三，我们将单射性付诸实践：提出了SipIt算法，这是首个能够从隐藏激活中可证明且高效地重构出精确输入文本的算法，确立了线性时间保证，并在实践中展示了精确的可逆性。总体而言，我们的工作确立了单射性作为语言模型的一个基本且可利用的属性，对透明度、可解释性及安全部署具有直接意义。

English

Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.

语言模型是单射的，因此可逆。

Language Models are Injective and Hence Invertible

摘要

Support