ChatPaper.aiChatPaper

脑基轴:解读与引导大语言模型状态的双重路径

Brain-Grounded Axes for Reading and Steering LLM States

December 22, 2025
作者: Sandro Andric
cs.AI

摘要

针对大型语言模型(LLM)的可解释性方法通常依赖文本监督来推导语义方向,但此类方法缺乏外部实体锚定。我们提出以人脑活动作为坐标系(而非训练信号)来解读和调控LLM的内部状态。基于SMN4Lang脑磁图数据集,我们构建了词级锁相值模式图谱,并通过独立成分分析提取潜在轴。利用独立词典和基于命名实体识别的标签(词性/对数词频作为验证基准)验证这些轴后,我们训练了轻量级适配器,在不微调LLM的情况下将其隐藏状态映射至脑电轴。沿脑电轴调控模型时,在TinyLlama中间层发现了一个稳健的词汇轴(与词频相关),该结果在困惑度匹配控制实验中依然存在;脑电轴与文本探针的对比显示,前者在更低困惑度下产生了更大的对数词频偏移。功能/内容轴(轴13)在TinyLlama、Qwen2-0.5B和GPT-2中均呈现一致的调控效果,并获文本层级困惑度匹配验证。TinyLlama第4层效应显著但不稳定,故视作次要发现(见附录)。当剔除GPT嵌入变化特征或改用word2vec嵌入重建图谱时,轴结构保持稳定(匹配轴间|r|=0.64-0.95),降低了循环论证风险。探索性功能磁共振锚定表明嵌入变化与对数词频可能存在关联,但该效应对血流动力学模型假设敏感,仅视为群体层级证据。这些成果确立了一种新范式:基于神经生理学的坐标轴为LLM行为提供了可解释且可控的调控接口。
English
Interpretability methods for large language models (LLMs) typically derive directions from textual supervision, which can lack external grounding. We propose using human brain activity not as a training signal but as a coordinate system for reading and steering LLM states. Using the SMN4Lang MEG dataset, we construct a word-level brain atlas of phase-locking value (PLV) patterns and extract latent axes via ICA. We validate axes with independent lexica and NER-based labels (POS/log-frequency used as sanity checks), then train lightweight adapters that map LLM hidden states to these brain axes without fine-tuning the LLM. Steering along the resulting brain-derived directions yields a robust lexical (frequency-linked) axis in a mid TinyLlama layer, surviving perplexity-matched controls, and a brain-vs-text probe comparison shows larger log-frequency shifts (relative to the text probe) with lower perplexity for the brain axis. A function/content axis (axis 13) shows consistent steering in TinyLlama, Qwen2-0.5B, and GPT-2, with PPL-matched text-level corroboration. Layer-4 effects in TinyLlama are large but inconsistent, so we treat them as secondary (Appendix). Axis structure is stable when the atlas is rebuilt without GPT embedding-change features or with word2vec embeddings (|r|=0.64-0.95 across matched axes), reducing circularity concerns. Exploratory fMRI anchoring suggests potential alignment for embedding change and log frequency, but effects are sensitive to hemodynamic modeling assumptions and are treated as population-level evidence only. These results support a new interface: neurophysiology-grounded axes provide interpretable and controllable handles for LLM behavior.
PDF12December 24, 2025