ChatPaper.aiChatPaper

维度袋:基于维度级符号模式的免训练机械可解释性

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

June 17, 2026
作者: Varun Reddy Nalagatla
cs.AI

摘要

我们展示了Transformer隐藏状态的标准基已提供一种无需训练、架构通用的特征基。单个维度通过其符号(+/-1)编码语义内容,通过其幅度编码置信度,充当独立的二进制寄存器;特征是指具有一致符号模式的维度子集,通过统计符号一致性(无需学习旋转)来读取。我们在七个模型上验证了这一“维度袋”框架,涵盖语言模型(Qwen 3.5-4B、Gemma 3-4B、Mistral 7B、Qwen3-32B)、视觉模型(DINOv2、ViT-Base)和音频模型(AST)。 仅符号本身已携带预测性信息:单位幅度的符号模式通过语言模型头部保留了60-93%的Top-5下一个词元准确率,而无需解码器的汉明评分在Top-4096中达到80-90%准确率。基于单词元缓存(每个词元一次前向传播,无上下文,无标签),我们通过符号一致性检测到175个类别,AUC达到0.97-0.99;一个经过训练的探针仅增加+0.018 AUC,并收敛到轴对齐权重。这些特征具有因果可操作性:它们能通过K/V注意力投影存活下来,可追溯到写入它们的FFN神经元联盟(随机权重控制从未复现此现象),并且在实时前向传播过程中翻转某个特征的符号会抑制其概念,这在四个语言模型上均成立,且幅度匹配并针对特定概念。维度在整个过程中保持独立(成对互信息低于0.006比特)。 这种结构并非语言特有:相同的逐维度符号模式出现在自监督视觉(DINOv2,9/12个ImageNet超类)、监督视觉(ViT-Base,11/12)和音频(AST,50/50个ESC-50类别)中,因此它反映了Transformer训练的普遍性,而非语言建模目标。标准基已足以通过一次前向传播实现特征读取,无需优化,无需GPU天数。开放问题从寻找正确的旋转转向了编录每个维度编码的内容。
English
We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.