大型語言模型的可控性如何？跨行為粒度的統一評估框架

摘要

大型語言模型（LLMs）在社會敏感領域的應用日益廣泛，但其不可預測的行為——從意圖偏離到人格不一致——構成了重大風險。我們提出SteerEval，一個分層評估基準，用於檢驗LLM在語言特徵、情感和人格三大領域的可控性。每個領域均設有三層規範架構：L1（表達內容）、L2（表達方式）與L3（實例化方式），將高層次行為意圖與具體文本輸出相連結。透過SteerEval，我們系統性評估當代導向方法，發現控制精細度常隨層級深入而衰減。本基準為實現安全可控的LLM行為提供原則性與可解釋的框架，為未來研究奠定基礎。

English

Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.

大型語言模型的可控性如何？跨行為粒度的統一評估框架

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

摘要

Support