維度袋：透過維度層級符號模式實現無訓練機制可解釋性

摘要

我們證明了 Transformer 隱藏狀態的標準基已提供一種無需訓練、架構通用的特徵基。每個維度透過其正負號（+/-1）編碼語義內容，透過其大小編碼置信度，作為獨立的二元暫存器運作；特徵是具備一致符號模式的維度子集，透過統計符號一致數進行讀取，無需學習旋轉。我們在七個模型上驗證了這個「維度袋」（Bag of Dims）架構，涵蓋語言模型（Qwen 3.5-4B、Gemma 3-4B、Mistral 7B、Qwen3-32B）、視覺模型（DINOv2、ViT-Base）以及音訊模型（AST）。僅符號本身即承載預測性內容：單位大小的符號模式經由語言模型輸出層保留 60-93% 的前五個下一個 token 預測準確率，而無需解碼器的漢明評分可達前 4096 個中 80-90% 的準確率。藉由單一 token 快取（每個 token 僅一次前向傳播，無上下文、無標籤），我們透過符號一致檢測到 175 個類別，AUC 達 0.97-0.99；訓練過的探測器僅增加 +0.018 AUC，並收斂至軸對齊的權重。這些特徵具有因果作用：它們在 K/V 注意力投影後仍然存在，可追溯至寫入它們的 FFN 神經元聯盟（隨機權重對照從未複現此現象），且在即時前向傳播過程中翻轉某特徵的符號會抑制其概念，此現象在四個語言模型上均經大小匹配與概念特異性驗證。各維度在過程中保持獨立（成對互訊息低於 0.006 位元）。此結構並非語言特有：相同的逐維度符號模式出現在自監督視覺模型（DINOv2，9/12 個 ImageNet 超類）、監督式視覺模型（ViT-Base，11/12 個）以及音訊模型（AST，50/50 個 ESC-50 類別）中。因此，它反映的是 Transformer 訓練的普遍特性，而非語言建模目標。標準基已足以在一次前向傳播中讀取特徵，無需最佳化，無需 GPU 日。開放問題從尋找正確的旋轉，轉變為編目每個維度所編碼的內容。

English

We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.