YaPO:可學習稀疏激活導向向量於領域適應之應用
YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation
January 13, 2026
作者: Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry, Hadi Abdine, Preslav Nakov, Michalis Vazirgiannis, Guokan Shang
cs.AI
摘要
透過激活干預來引導大型語言模型(LLM)已成為對齊與個人化任務中輕量化的微調替代方案。近期雙向偏好優化(BiPO)研究顯示,可透過直接偏好優化(DPO)方式從偏好數據中直接學習稠密引導向量,從而實現對真實性、幻覺與安全行為的控制。然而,由於神經元的多語義特性,稠密引導向量常會糾結多個潛在因素,這在需要區分密切相關的價值觀與行為(例如中東文化間的細微差異)的文化對齊等細粒度場景中,限制了其效能與穩定性。本文提出「另一種策略優化」(YaPO),這是一種無需參考模型的創新方法,能在稀疏自編碼器(SAE)的潛在空間中學習稀疏引導向量。通過優化稀疏編碼,YaPO能產生解耦、可解釋且高效的引導方向。實證研究表明,相較於稠密引導基線,YaPO具有更快的收斂速度、更強的效能表現以及更高的訓練穩定性。除了文化對齊,YaPO還能泛化至多種對齊相關行為,包括幻覺控制、財富追求、越獄防護與權力追求。重要的是,YaPO能保持模型的通用知識能力,在MMLU基準測試中未出現可量化的效能衰退。總體而言,我們的結果表明YaPO為LLM提供了一種高效、穩定且細粒度的通用對齊方案,在可控性與領域適應方面具有廣泛應用前景。相關程式碼與數據已公開於https://github.com/MBZUAI-Paris/YaPO。
English
Steering Large Language Models (LLMs) through activation interventions has emerged as a lightweight alternative to fine-tuning for alignment and personalization. Recent work on Bi-directional Preference Optimization (BiPO) shows that dense steering vectors can be learned directly from preference data in a Direct Preference Optimization (DPO) fashion, enabling control over truthfulness, hallucinations, and safety behaviors. However, dense steering vectors often entangle multiple latent factors due to neuron multi-semanticity, limiting their effectiveness and stability in fine-grained settings such as cultural alignment, where closely related values and behaviors (e.g., among Middle Eastern cultures) must be distinguished. In this paper, we propose Yet another Policy Optimization (YaPO), a reference-free method that learns sparse steering vectors in the latent space of a Sparse Autoencoder (SAE). By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Empirically, we show that YaPO converges faster, achieves stronger performance, and exhibits improved training stability compared to dense steering baselines. Beyond cultural alignment, YaPO generalizes to a range of alignment-related behaviors, including hallucination, wealth-seeking, jailbreak, and power-seeking. Importantly, YaPO preserves general knowledge, with no measurable degradation on MMLU. Overall, our results show that YaPO provides a general recipe for efficient, stable, and fine-grained alignment of LLMs, with broad applications to controllability and domain adaptation. The associated code and data are publicly availablehttps://github.com/MBZUAI-Paris/YaPO.