UniSteer: 텍스트 기반 활성화 공간에서의 흐름 매칭을 통한 범용 LLM 제어

초록

활성화 기반 제어는 추론 과정에서 대규모 언어 모델(LLM)의 내부 표현에 개입하여 모델을 조종하며, 개인 및 스타일과 같은 행동을 제어하는 효과적인 패러다임으로 부상했다. 그러나 기존 방법들은 대개 고정된 조종 방향이나 작업별 개입 모듈에 의존하기 때문에, 세분화된 개념과 구성적 제약에 적응하기 어렵다는 한계가 있다. 우리는 UniSteer를 제안한다. 이는 텍스트 기반 활성화 흐름 정합 모델로, 자연어 조건으로부터 잔차 스트림 활성화에 대한 조건부 분포를 학습한다. 각 대상 행동에 대해 개별 개입을 학습하는 대신, UniSteer는 활성화 공간에서 보편적 조건부 속도장을 학습한다. 추론 시, UniSteer는 소스 활성화를 잠재 상태로 부분적으로 이동시키고, 대상 텍스트 조건 하에서 이를 재생성한 후 동결된 LLM에 다시 주입함으로써 흐름 역전을 수행한다. 동일한 조건부 모델은 가장 낮은 재구성 에너지를 가진 텍스트 레이블을 선택함으로써 활성화 공간 분류도 지원한다. 세 가지 대상 LLM에 대한 실험 결과, UniSteer가 행동 제어, 진실성 조종, 세분화된 개념 조종, 다중 제약 명령 수행, 그리고 활성화 공간 분류에 걸쳐 통합된 인터페이스를 제공함을 보여준다.

English

Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing methods often rely on fixed steering directions or task-specific intervention modules, making them difficult to adapt to fine-grained concepts and compositional constraints. We propose UniSteer, a text-guided activation flow matching model that learns a conditional distribution over residual-stream activations from natural-language conditions. Instead of fitting a separate intervention for each target behavior, UniSteer learns a universal conditional velocity field in activation space. At inference time, UniSteer performs flow inversion by partially transporting a source activation toward a latent state and regenerating it under a target textual condition before injecting it back into the frozen LLM. The same conditional model supports activation-space classification by selecting the textual label with the lowest reconstruction energy. Experiments on three target LLMs show that UniSteer provides a unified interface across behavioral control, truthfulness steering, fine-grained concept steering, multi-constraint instruction following, and activation-space classification.