PLDR-LLMs 學會了一種可泛化的張量運算元，能在推理階段替代其自身的深度神經網絡。

摘要

我們展示了來自冪律解碼器表示（PLDR-LLM）的大型語言模型是一種基礎模型，其推導輸出在微小擾動下保持不變的張量。PLDR-LLM學習了一種推導輸出的奇異性條件，使得一旦推導出的能量-曲率張量G_{LM}能夠在推理階段替代生成推導輸出的冪律圖注意力（PLGA）深度神經網絡。我們證明，G_{LM}的緩存（G-cache）與KV-cache可以以直觀的方式實現，從而提升推理時間。推導輸出的不變性與泛化特性具有極高的保真度，在緩存後，推導輸出的均方根誤差（RMSE）和行列式值在小數點後15位保持一致，且零樣本基準測試分數保持不變。消融研究表明，學習到的推導輸出具有與使用遷移、隨機初始化或恆等張量作為常數張量運算符預訓練的模型不同的損失和準確性特徵，而採用縮放點積注意力（SDPA）的LLM是PLDR-LLM的一個特例，其中G_{LM}被預定義為恆等張量。觀察到的不變性特性引入了一種訓練與推理階段在緩存下的新穎不對稱性。我們概述了學習到的奇異性條件下推導輸出的共同特徵。我們提供了一個包含KV-cache和G-cache的PLDR-LLM訓練與推理框架的實現。

English

We show that Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a foundational model whose deductive outputs are invariant tensors up to a small perturbation. PLDR-LLM learns a singularity condition for the deductive outputs that enable the once-inferred energy-curvature tensor G_{LM} to replace the deep neural network of power law graph attention (PLGA) generating the deductive outputs at inference. We demonstrate that a cache for G_{LM} (G-cache) and KV-cache can be implemented in a straightforward manner to improve the inference time. The invariance and generalizable nature of deductive outputs is at a very high fidelity where deductive outputs have same RMSE and determinant values up to 15 decimal places after caching, and zero-shot benchmark scores remain unchanged. Ablation studies show that learned deductive outputs have distinct loss and accuracy characteristics from models pretrained with transferred, randomly initialized or identity tensors as a constant tensor operator and an LLM with scaled-dot product attention (SDPA) is a special case of PLDR-LLM where G_{LM} is predefined as identity. The observed invariance characteristic introduces a novel asymmetry between training and inference phases with caching. We outline observed common characteristics of the deductive outputs for the learned singularity condition. We provide an implementation of a training and inference framework for PLDR-LLM with KV-cache and G-cache.

PLDR-LLMs 學會了一種可泛化的張量運算元，能在推理階段替代其自身的深度神經網絡。

PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference

摘要

Support