ChatPaper.aiChatPaper

通过Transformer潜在子空间激活引导概念偏差

Steering Conceptual Bias via Transformer Latent-Subspace Activation

June 23, 2025
作者: Vansh Sharma, Venkat Raman
cs.AI

摘要

本研究探讨了激活语言模型(LLMs)中的潜在子空间是否能引导科学代码生成偏向特定编程语言。首先,对五个因果LLMs在科学编码提示上的表现进行了评估,以量化其在四种编程语言间的基线偏好。采用静态神经元归因方法,即扰动C++或CPP标记的最高激活MLP权重,发现该方法脆弱且在不同提示风格和模型规模间泛化能力有限。为克服这些局限,开发了一种梯度优化的自适应激活引导框架(G-ACT):将每个提示的激活差异聚类为少量引导方向,并在线训练和优化轻量级的逐层探针以选择合适的引导向量。在LLaMA-3.2 3B模型中,此方法通过将探针分类准确率平均提升15%,并在早期层(0-6)使探针分类准确率相比标准ACT框架提高61.5%,可靠地引导生成偏向CPP语言。对于LLaMA-3.3 70B模型,尽管注意力头信号更为分散,但在关键层进行定向注入仍能改善语言选择。虽然逐层探测引入了适度的推理开销,但通过仅引导部分层,该方法保持实用性并确保了模型行为的可复现性。这些结果展示了一种可扩展、可解释且高效的概念级控制机制,适用于实际代理系统。
English
This work examines whether activating latent subspaces in language models (LLMs) can steer scientific code generation toward a specific programming language. Five causal LLMs were first evaluated on scientific coding prompts to quantify their baseline bias among four programming languages. A static neuron-attribution method, perturbing the highest activated MLP weight for a C++ or CPP token, proved brittle and exhibited limited generalization across prompt styles and model scales. To address these limitations, a gradient-refined adaptive activation steering framework (G-ACT) was developed: per-prompt activation differences are clustered into a small set of steering directions, and lightweight per-layer probes are trained and refined online to select the appropriate steering vector. In LLaMA-3.2 3B, this approach reliably biases generation towards the CPP language by increasing the average probe classification accuracy by 15% and the early layers (0-6) improving the probe classification accuracy by 61.5% compared to the standard ACT framework. For LLaMA-3.3 70B, where attention-head signals become more diffuse, targeted injections at key layers still improve language selection. Although per-layer probing introduces a modest inference overhead, it remains practical by steering only a subset of layers and enables reproducible model behavior. These results demonstrate a scalable, interpretable and efficient mechanism for concept-level control for practical agentic systems.
PDF51June 24, 2025