ChatPaper.aiChatPaper

基于脑电基础的LLM状态读取与调控坐标轴

Brain-Grounded Axes for Reading and Steering LLM States

December 22, 2025
作者: Sandro Andric
cs.AI

摘要

針對大型語言模型(LLM)的可解釋性方法通常依賴文本監督來推導語義方向,這種方式可能缺乏外部實證基礎。我們提出以人類大腦活動作為坐標系(而非訓練信號)來解讀和調控LLM的內部狀態。基於SMN4Lang腦磁圖數據集,我們構建了詞彙級別的相位鎖定值(PLV)模式腦圖譜,並通過獨立成分分析(ICA)提取潛在軸向。我們採用獨立詞庫和基於命名實體識別(NER)的標籤驗證軸向(使用詞性標註/對數詞頻作為對照基準),隨後訓練輕量級適配器將LLM隱藏狀態映射至這些大腦軸向,而無需微調LLM本身。 通過沿大腦導向軸進行調控,我們在TinyLlama中間層發現了一個穩健的詞彙軸向(與詞頻相關),該效應在困惑度匹配的對照實驗中依然存在。大腦探針與文本探針的對比顯示:大腦軸向在保持更低困惑度的同時,能引發更大的對數詞頻偏移。此外,一個功能/內容軸向(軸13)在TinyLlama、Qwen2-0.5B和GPT-2中均表現出一致的調控效果,並獲得困惑度匹配的文本層面佐證。TinyLlama第4層的效應顯著但不穩定,因此我們將其視作次要發現(見附錄)。當腦圖譜重建時排除GPT嵌入變化特徵或改用word2vec嵌入,軸向結構保持穩定(匹配軸向間|r|=0.64-0.95),降低了循環論證風險。探索性功能磁共振成像錨定實驗提示嵌入變化與對數詞頻可能存在潛在關聯,但該效應對血流動力學建模假設敏感,僅作為群體層面證據。 這些成果確立了一種新範式:以神經生理學為基礎的軸向為LLM行為提供了可解釋且可操控的調控接口。
English
Interpretability methods for large language models (LLMs) typically derive directions from textual supervision, which can lack external grounding. We propose using human brain activity not as a training signal but as a coordinate system for reading and steering LLM states. Using the SMN4Lang MEG dataset, we construct a word-level brain atlas of phase-locking value (PLV) patterns and extract latent axes via ICA. We validate axes with independent lexica and NER-based labels (POS/log-frequency used as sanity checks), then train lightweight adapters that map LLM hidden states to these brain axes without fine-tuning the LLM. Steering along the resulting brain-derived directions yields a robust lexical (frequency-linked) axis in a mid TinyLlama layer, surviving perplexity-matched controls, and a brain-vs-text probe comparison shows larger log-frequency shifts (relative to the text probe) with lower perplexity for the brain axis. A function/content axis (axis 13) shows consistent steering in TinyLlama, Qwen2-0.5B, and GPT-2, with PPL-matched text-level corroboration. Layer-4 effects in TinyLlama are large but inconsistent, so we treat them as secondary (Appendix). Axis structure is stable when the atlas is rebuilt without GPT embedding-change features or with word2vec embeddings (|r|=0.64-0.95 across matched axes), reducing circularity concerns. Exploratory fMRI anchoring suggests potential alignment for embedding change and log frequency, but effects are sensitive to hemodynamic modeling assumptions and are treated as population-level evidence only. These results support a new interface: neurophysiology-grounded axes provide interpretable and controllable handles for LLM behavior.
PDF12December 24, 2025