SteeringControl：大型語言模型對齊控制的全面評估

摘要

我們推出了SteeringControl，這是一個用於評估表徵導向方法在核心對齊目標——偏見、有害生成和幻覺——及其對次要行為（如諂媚和常識道德）影響的基準。雖然先前的對齊工作常以真實性或推理能力來展示表徵導向的副作用，但我們發現仍有許多未被系統性理解的權衡。我們收集了一個與安全相關的主要和次要行為數據集，圍繞五種流行的導向方法來評估導向效果和行為糾纏。為此，我們基於獨特組件構建了一個模塊化導向框架，這些組件是許多現有方法的基礎。我們在Qwen-2.5-7B和Llama-3.1-8B上的研究結果表明，強勁的導向性能依賴於導向方法、模型和目標行為的特定組合，而這三者的不良組合也可能導致嚴重的概念糾纏。我們在此發布了代碼： https://github.com/wang-research-lab/SteeringControl.git。

English

We introduce SteeringControl, a benchmark for evaluating representation steering methods across core alignment objectives--bias, harmful generation, and hallucination--and their effects on secondary behaviors such as sycophancy and commonsense morality. While prior alignment work often highlights truthfulness or reasoning ability to demonstrate the side effects of representation steering, we find there are many unexplored tradeoffs not yet understood in a systematic way. We collect a dataset of safety-relevant primary and secondary behaviors to evaluate steering effectiveness and behavioral entanglement centered around five popular steering methods. To enable this, we craft a modular steering framework based on unique components that serve as the building blocks of many existing methods. Our results on Qwen-2.5-7B and Llama-3.1-8B find that strong steering performance is dependent on the specific combination of steering method, model, and targeted behavior, and that severe concept entanglement can result from poor combinations of these three as well. We release our code here: https://github.com/wang-research-lab/SteeringControl.git.

SteeringControl：大型語言模型對齊控制的全面評估

SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

摘要

Support