通过Transformer潜在子空间激活引导概念偏差

摘要

本研究探讨了激活语言模型（LLMs）中的潜在子空间是否能引导科学代码生成偏向特定编程语言。首先，对五个因果LLMs在科学编码提示上的表现进行了评估，以量化其在四种编程语言间的基线偏好。采用静态神经元归因方法，即扰动C++或CPP标记的最高激活MLP权重，发现该方法脆弱且在不同提示风格和模型规模间泛化能力有限。为克服这些局限，开发了一种梯度优化的自适应激活引导框架（G-ACT）：将每个提示的激活差异聚类为少量引导方向，并在线训练和优化轻量级的逐层探针以选择合适的引导向量。在LLaMA-3.2 3B模型中，此方法通过将探针分类准确率平均提升15%，并在早期层（0-6）使探针分类准确率相比标准ACT框架提高61.5%，可靠地引导生成偏向CPP语言。对于LLaMA-3.3 70B模型，尽管注意力头信号更为分散，但在关键层进行定向注入仍能改善语言选择。虽然逐层探测引入了适度的推理开销，但通过仅引导部分层，该方法保持实用性并确保了模型行为的可复现性。这些结果展示了一种可扩展、可解释且高效的概念级控制机制，适用于实际代理系统。

English

This work examines whether activating latent subspaces in language models (LLMs) can steer scientific code generation toward a specific programming language. Five causal LLMs were first evaluated on scientific coding prompts to quantify their baseline bias among four programming languages. A static neuron-attribution method, perturbing the highest activated MLP weight for a C++ or CPP token, proved brittle and exhibited limited generalization across prompt styles and model scales. To address these limitations, a gradient-refined adaptive activation steering framework (G-ACT) was developed: per-prompt activation differences are clustered into a small set of steering directions, and lightweight per-layer probes are trained and refined online to select the appropriate steering vector. In LLaMA-3.2 3B, this approach reliably biases generation towards the CPP language by increasing the average probe classification accuracy by 15% and the early layers (0-6) improving the probe classification accuracy by 61.5% compared to the standard ACT framework. For LLaMA-3.3 70B, where attention-head signals become more diffuse, targeted injections at key layers still improve language selection. Although per-layer probing introduces a modest inference overhead, it remains practical by steering only a subset of layers and enables reproducible model behavior. These results demonstrate a scalable, interpretable and efficient mechanism for concept-level control for practical agentic systems.

通过Transformer潜在子空间激活引导概念偏差

Steering Conceptual Bias via Transformer Latent-Subspace Activation

摘要

Support