ChatPaper.aiChatPaper

透過變壓器潛在子空間激活引導概念偏見

Steering Conceptual Bias via Transformer Latent-Subspace Activation

June 23, 2025
作者: Vansh Sharma, Venkat Raman
cs.AI

摘要

本研究探討了在語言模型(LLMs)中激活潛在子空間是否能引導科學代碼生成朝向特定程式語言。首先,對五種因果LLMs在科學編程提示上的表現進行了評估,以量化其在四種程式語言中的基礎偏差。一種靜態神經元歸因方法,即擾動C++或CPP標記的最高激活MLP權重,被證明是脆弱的,並在提示風格和模型規模上表現出有限的泛化能力。為解決這些限制,開發了一種梯度精煉的自適應激活引導框架(G-ACT):每個提示的激活差異被聚類為一小組引導方向,並在線訓練和精煉輕量級的每層探測器,以選擇適當的引導向量。在LLaMA-3.2 3B中,這種方法可靠地將生成偏向CPP語言,使平均探測分類準確率提高了15%,並且與標準ACT框架相比,早期層(0-6)的探測分類準確率提高了61.5%。對於LLaMA-3.3 70B,其中注意力頭信號變得更加分散,在關鍵層進行有針對性的注入仍能改善語言選擇。儘管每層探測引入了適度的推理開銷,但通過僅引導一部分層次,它仍然實用,並實現了可重現的模型行為。這些結果展示了一種可擴展、可解釋且高效的機制,用於實際代理系統的概念級控制。
English
This work examines whether activating latent subspaces in language models (LLMs) can steer scientific code generation toward a specific programming language. Five causal LLMs were first evaluated on scientific coding prompts to quantify their baseline bias among four programming languages. A static neuron-attribution method, perturbing the highest activated MLP weight for a C++ or CPP token, proved brittle and exhibited limited generalization across prompt styles and model scales. To address these limitations, a gradient-refined adaptive activation steering framework (G-ACT) was developed: per-prompt activation differences are clustered into a small set of steering directions, and lightweight per-layer probes are trained and refined online to select the appropriate steering vector. In LLaMA-3.2 3B, this approach reliably biases generation towards the CPP language by increasing the average probe classification accuracy by 15% and the early layers (0-6) improving the probe classification accuracy by 61.5% compared to the standard ACT framework. For LLaMA-3.3 70B, where attention-head signals become more diffuse, targeted injections at key layers still improve language selection. Although per-layer probing introduces a modest inference overhead, it remains practical by steering only a subset of layers and enables reproducible model behavior. These results demonstrate a scalable, interpretable and efficient mechanism for concept-level control for practical agentic systems.
PDF41June 24, 2025