트랜스포머 잠재-부분공간 활성화를 통한 개념적 편향 조정

초록

본 연구는 언어 모델(LLM) 내 잠재된 부분공간을 활성화함으로써 과학적 코드 생성을 특정 프로그래밍 언어로 유도할 수 있는지 여부를 탐구한다. 먼저, 다섯 가지 인과적 LLM을 과학적 코딩 프롬프트에 대해 평가하여 네 가지 프로그래밍 언어 간의 기본 편향을 정량화하였다. C++ 또는 CPP 토큰에 대해 가장 높게 활성화된 MLP 가중치를 교란시키는 정적 뉴런-속성 방법은 취약성을 보였으며, 프롬프트 스타일과 모델 규모에 걸쳐 제한된 일반화를 나타냈다. 이러한 한계를 해결하기 위해 그래디언트 기반 적응형 활성화 유도 프레임워크(G-ACT)를 개발하였다: 프롬프트별 활성화 차이를 소수의 유도 방향으로 클러스터링하고, 경량의 레이어별 프로브를 온라인으로 훈련 및 개선하여 적절한 유도 벡터를 선택한다. LLaMA-3.2 3B에서 이 접근법은 CPP 언어로의 생성을 안정적으로 편향시켰으며, 평균 프로브 분류 정확도를 15% 증가시키고, 초기 레이어(0-6)에서 프로브 분류 정확도를 표준 ACT 프레임워크 대비 61.5% 향상시켰다. 주의 헤드 신호가 더 확산되는 LLaMA-3.3 70B의 경우, 주요 레이어에서의 표적 주입은 여전히 언어 선택을 개선한다. 레이어별 프로빙은 약간의 추론 오버헤드를 도입하지만, 일부 레이어만을 유도함으로써 실용적이며 재현 가능한 모델 동작을 가능하게 한다. 이러한 결과는 실용적인 에이전트 시스템을 위한 개념 수준의 제어를 위한 확장 가능하고 해석 가능하며 효율적인 메커니즘을 입증한다.

English

This work examines whether activating latent subspaces in language models (LLMs) can steer scientific code generation toward a specific programming language. Five causal LLMs were first evaluated on scientific coding prompts to quantify their baseline bias among four programming languages. A static neuron-attribution method, perturbing the highest activated MLP weight for a C++ or CPP token, proved brittle and exhibited limited generalization across prompt styles and model scales. To address these limitations, a gradient-refined adaptive activation steering framework (G-ACT) was developed: per-prompt activation differences are clustered into a small set of steering directions, and lightweight per-layer probes are trained and refined online to select the appropriate steering vector. In LLaMA-3.2 3B, this approach reliably biases generation towards the CPP language by increasing the average probe classification accuracy by 15% and the early layers (0-6) improving the probe classification accuracy by 61.5% compared to the standard ACT framework. For LLaMA-3.3 70B, where attention-head signals become more diffuse, targeted injections at key layers still improve language selection. Although per-layer probing introduces a modest inference overhead, it remains practical by steering only a subset of layers and enables reproducible model behavior. These results demonstrate a scalable, interpretable and efficient mechanism for concept-level control for practical agentic systems.

트랜스포머 잠재-부분공간 활성화를 통한 개념적 편향 조정

Steering Conceptual Bias via Transformer Latent-Subspace Activation

초록

Support