대규모 언어 모델은 어느 정도 제어 가능한가? 행동 세분성에 걸친 통합 평가

초록

대규모 언어 모델(LLM)이 사회적으로 민감한 영역에 점차 배포되고 있지만, 의도 불일치부터 일관성 없는 성격에 이르기까지 예측 불가능한 행동은 상당한 위험을 초래합니다. 본 연구에서는 언어 특성, 감정, 성격이라는 세 가지 영역에 걸쳐 LLM 제어 가능성을 평가하기 위한 계층적 벤치마크인 SteerEval을 소개합니다. 각 영역은 L1(표현 내용), L2(표현 방식), L3(구체적 구현)의 세 가지 명세 수준으로 구성되어 상위 수준의 행동 의도를 구체적인 텍스트 출력과 연결합니다. SteerEval을 활용하여 현대적 제어 방법을 체계적으로 평가한 결과, 제어가 세부 수준으로 갈수록 약화되는 경향이 나타났습니다. 본 벤치마크는 안전하고 제어 가능한 LLM 행동을 위한 원칙적이고 해석 가능한 프레임워크를 제공하며, 향후 연구의 기초를 마련합니다.

English

Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.

대규모 언어 모델은 어느 정도 제어 가능한가? 행동 세분성에 걸친 통합 평가

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

초록

Support