UniSteer：文本引導的激活空間流匹配以實現多功能LLM引導

摘要

基於激活的控制通過在推理過程中干預大型語言模型（LLMs）的內部表徵來引導它們，並已成為控制如角色與風格等行為的有效範式。然而，現有方法通常依賴於固定的引導方向或特定任務的干預模組，使其難以適應細粒度的概念與組合約束。我們提出 UniSteer，一個文本引導的激活流匹配模型，該模型從自然語言條件中學習殘差流激活上的條件分佈。與其為每個目標行為擬合單獨的干預，UniSteer 在激活空間中學習一個通用的條件速度場。在推論時，UniSteer 通過將源激活部分傳輸到潛在狀態，並在目標文本條件下重新生成它，然後將其注入回凍結的 LLM 中，來執行流反演。同一條件模型通過選擇具有最低重建能量的文本標籤來支持激活空間分類。在三個目標 LLM 上的實驗表明，UniSteer 在行為控制、真實性引導、細粒度概念引導、多約束指令遵循以及激活空間分類方面提供了統一的介面。

English

Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing methods often rely on fixed steering directions or task-specific intervention modules, making them difficult to adapt to fine-grained concepts and compositional constraints. We propose UniSteer, a text-guided activation flow matching model that learns a conditional distribution over residual-stream activations from natural-language conditions. Instead of fitting a separate intervention for each target behavior, UniSteer learns a universal conditional velocity field in activation space. At inference time, UniSteer performs flow inversion by partially transporting a source activation toward a latent state and regenerating it under a target textual condition before injecting it back into the frozen LLM. The same conditional model supports activation-space classification by selecting the textual label with the lowest reconstruction energy. Experiments on three target LLMs show that UniSteer provides a unified interface across behavioral control, truthfulness steering, fine-grained concept steering, multi-constraint instruction following, and activation-space classification.