언어 모델은 단사 함수이며, 따라서 역함수가 존재한다.

초록

비선형 활성화 함수와 정규화와 같은 Transformer 구성 요소는 본질적으로 비단사적(non-injective)이므로, 서로 다른 입력이 동일한 출력으로 매핑될 수 있으며 모델의 표현으로부터 입력을 정확히 복구하는 것을 방해할 수 있습니다. 본 논문에서는 이러한 관점에 도전합니다. 첫째, 이산 입력 시퀀스를 해당하는 연속 표현 시퀀스로 매핑하는 트랜스포머 언어 모델이 초기화 시점에 설정되고 훈련 과정에서 유지되는 단사적(injective)이며 따라서 무손실(lossless)임을 수학적으로 증명합니다. 둘째, 여섯 개의 최신 언어 모델에 대해 수십억 번의 충돌 테스트를 통해 이 결과를 실증적으로 확인하고, 어떠한 충돌도 관찰하지 못했습니다. 셋째, 단사성을 실제로 활용할 수 있도록 합니다: 우리는 SipIt 알고리즘을 소개하는데, 이는 은닉 활성화로부터 정확한 입력 텍스트를 복구할 수 있음을 증명하며 선형 시간 보장을 확립하고 실제로 정확한 역변환 가능성을 입증하는 최초의 알고리즘입니다. 전반적으로, 우리의 연구는 언어 모델의 근본적이고 활용 가능한 속성으로서의 단사성을 확립하며, 이는 투명성, 해석 가능성, 안전한 배포에 직접적인 영향을 미칩니다.

English

Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.

언어 모델은 단사 함수이며, 따라서 역함수가 존재한다.

Language Models are Injective and Hence Invertible

초록

Support