ChatPaper.aiChatPaper

維度袋:透過維度層級符號模式實現無訓練機制可解釋性

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

June 17, 2026
作者: Varun Reddy Nalagatla
cs.AI

摘要

我們證明了 Transformer 隱藏狀態的標準基已提供一種無需訓練、架構通用的特徵基。每個維度透過其正負號(+/-1)編碼語義內容,透過其大小編碼置信度,作為獨立的二元暫存器運作;特徵是具備一致符號模式的維度子集,透過統計符號一致數進行讀取,無需學習旋轉。我們在七個模型上驗證了這個「維度袋」(Bag of Dims)架構,涵蓋語言模型(Qwen 3.5-4B、Gemma 3-4B、Mistral 7B、Qwen3-32B)、視覺模型(DINOv2、ViT-Base)以及音訊模型(AST)。 僅符號本身即承載預測性內容:單位大小的符號模式經由語言模型輸出層保留 60-93% 的前五個下一個 token 預測準確率,而無需解碼器的漢明評分可達前 4096 個中 80-90% 的準確率。藉由單一 token 快取(每個 token 僅一次前向傳播,無上下文、無標籤),我們透過符號一致檢測到 175 個類別,AUC 達 0.97-0.99;訓練過的探測器僅增加 +0.018 AUC,並收斂至軸對齊的權重。這些特徵具有因果作用:它們在 K/V 注意力投影後仍然存在,可追溯至寫入它們的 FFN 神經元聯盟(隨機權重對照從未複現此現象),且在即時前向傳播過程中翻轉某特徵的符號會抑制其概念,此現象在四個語言模型上均經大小匹配與概念特異性驗證。各維度在過程中保持獨立(成對互訊息低於 0.006 位元)。 此結構並非語言特有:相同的逐維度符號模式出現在自監督視覺模型(DINOv2,9/12 個 ImageNet 超類)、監督式視覺模型(ViT-Base,11/12 個)以及音訊模型(AST,50/50 個 ESC-50 類別)中。因此,它反映的是 Transformer 訓練的普遍特性,而非語言建模目標。標準基已足以在一次前向傳播中讀取特徵,無需最佳化,無需 GPU 日。開放問題從尋找正確的旋轉,轉變為編目每個維度所編碼的內容。
English
We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.