维度袋：基于维度级符号模式的免训练机械可解释性

摘要

我们展示了Transformer隐藏状态的标准基已提供一种无需训练、架构通用的特征基。单个维度通过其符号（+/-1）编码语义内容，通过其幅度编码置信度，充当独立的二进制寄存器；特征是指具有一致符号模式的维度子集，通过统计符号一致性（无需学习旋转）来读取。我们在七个模型上验证了这一“维度袋”框架，涵盖语言模型（Qwen 3.5-4B、Gemma 3-4B、Mistral 7B、Qwen3-32B）、视觉模型（DINOv2、ViT-Base）和音频模型（AST）。仅符号本身已携带预测性信息：单位幅度的符号模式通过语言模型头部保留了60-93%的Top-5下一个词元准确率，而无需解码器的汉明评分在Top-4096中达到80-90%准确率。基于单词元缓存（每个词元一次前向传播，无上下文，无标签），我们通过符号一致性检测到175个类别，AUC达到0.97-0.99；一个经过训练的探针仅增加+0.018 AUC，并收敛到轴对齐权重。这些特征具有因果可操作性：它们能通过K/V注意力投影存活下来，可追溯到写入它们的FFN神经元联盟（随机权重控制从未复现此现象），并且在实时前向传播过程中翻转某个特征的符号会抑制其概念，这在四个语言模型上均成立，且幅度匹配并针对特定概念。维度在整个过程中保持独立（成对互信息低于0.006比特）。这种结构并非语言特有：相同的逐维度符号模式出现在自监督视觉（DINOv2，9/12个ImageNet超类）、监督视觉（ViT-Base，11/12）和音频（AST，50/50个ESC-50类别）中，因此它反映了Transformer训练的普遍性，而非语言建模目标。标准基已足以通过一次前向传播实现特征读取，无需优化，无需GPU天数。开放问题从寻找正确的旋转转向了编录每个维度编码的内容。

English

We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.