ChatPaper.aiChatPaper

大型語言模型的可控性如何?跨行為粒度的統一評估框架

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

March 3, 2026
作者: Ziwen Xu, Kewei Xu, Haoming Xu, Haiwen Hong, Longtao Huang, Hui Xue, Ningyu Zhang, Yongliang Shen, Guozhou Zheng, Huajun Chen, Shumin Deng
cs.AI

摘要

大型語言模型(LLMs)在社會敏感領域的應用日益廣泛,但其不可預測的行為——從意圖偏離到人格不一致——構成了重大風險。我們提出SteerEval,一個分層評估基準,用於檢驗LLM在語言特徵、情感和人格三大領域的可控性。每個領域均設有三層規範架構:L1(表達內容)、L2(表達方式)與L3(實例化方式),將高層次行為意圖與具體文本輸出相連結。透過SteerEval,我們系統性評估當代導向方法,發現控制精細度常隨層級深入而衰減。本基準為實現安全可控的LLM行為提供原則性與可解釋的框架,為未來研究奠定基礎。
English
Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.
PDF252May 8, 2026