ChatPaper.aiChatPaper

大型语言模型的可控性如何?跨行为粒度的统一评估

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

March 3, 2026
作者: Ziwen Xu, Kewei Xu, Haoming Xu, Haiwen Hong, Longtao Huang, Hui Xue, Ningyu Zhang, Yongliang Shen, Guozhou Zheng, Huajun Chen, Shumin Deng
cs.AI

摘要

大型语言模型(LLMs)正日益应用于社会敏感领域,但其从意图偏差到人格不一致等不可预测行为带来了显著风险。我们提出SteerEval——一个分层评估基准,用于在语言特征、情感和人格三大领域评估LLM的可控性。每个领域均构建了三个规范层级:L1(表达内容)、L2(表达方式)和L3(实例化方式),将高层次行为意图与具体文本输出相连接。通过SteerEval,我们系统评估了当代调控方法,发现控制效果往往在更细粒度层级出现衰减。该基准为构建安全可控的LLM行为提供了原则化、可解释的评估框架,为未来研究奠定基础。
English
Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.
PDF252May 8, 2026