大規模言語モデルはどの程度制御可能か？行動の粒度を横断した統合的評価

要旨

大規模言語モデル（LLM）は社会的にセンシティブな領域で展開が進む一方、意図の不一致から人格の不一致に至るまで予測不能な振る舞いが重大なリスクをもたらしている。本研究では、言語特徴・感情・人格の3領域にわたるLLMの制御性を評価する階層的ベンチマーク「SteerEval」を提案する。各領域は3つの仕様レベル（L1：表現内容、L2：表現方法、L3：具体化方法）で構造化され、高次元の行動意図から具体的なテキスト出力までを結びつける。SteerEvalを用いて現代的な制御手法を系統的に評価した結果、制御効果は細粒度レベルで劣化しやすいことが明らかになった。本ベンチマークは、安全で制御可能なLLM行動のための原理的かつ解釈可能な枠組みを提供し、将来の研究基盤となるものである。

English

Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.

大規模言語モデルはどの程度制御可能か？行動の粒度を横断した統合的評価

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

要旨

Support