LLM이 언제 조종 가능해지나요?

초록

활성 조정(activation steering)은 추론 시점에서 언어 모델의 행동을 제어하는 가벼운 접근법을 제공하지만, 성공 여부는 프롬프트, 개념, 모델 및 조정 구성에 크게 의존한다. 성공적인 조정의 영역과 경계를 찾기 위해서는 일반적으로 비용이 많이 드는 그리드 탐색과 전체 자동회귀 롤아웃(autoregressive rollout)에 대한 사후 평가가 필요하다. 본 연구에서는 생성 과정 초기, 예를 들어 처음 몇 개의 토큰을 생성한 후에 모델의 내부 상태로부터 조정 가능성(steerability)을 예측할 수 있는지, 그리고 이러한 예측기를 활용하여 조정 성공률을 향상시키는 방법을 조사한다. 이를 위해 먼저 ASTEER를 소개한다. ASTEER는 150개의 개념에 걸쳐 각각 성공/실패로 레이블링된 140만 개의 조정된 생성문을 포함하는 테스트베드이다. 이 테스트베드를 활용하여, 조정 전후의 은닉 상태를 계층 및 초기 디코딩 단계별로 비교하는 특징을 추출함으로써 모델의 초기 디코딩 동역학을 분석한다. 이러한 특징은 조정 효과가 계층과 토큰 위치를 따라 어떻게 전파되는지 이해하는 데 도움을 주며, 이는 조정 가능성 예측의 핵심 정보를 제공한다. 그런 다음 이러한 특징에 대해 그래디언트 부스팅 결정 트리(GBDT) 분류기를 훈련하여 전체 롤아웃 없이 개입이 과소 조정, 성공, 또는 과대 조정 중 어떤 결과를 초래할지 예측한다. 해당 예측기는 보지 못한 개념에 대해 약 0.7의 매크로 F1 점수를 달성하며, 초기 은닉 상태가 최종 조정 효과에 대한 상당한 구조화된 정보를 인코딩함을 입증한다. 또한 이 조정 가능성 예측기를 조정 강도 탐색의 지침으로 활용하여, 적은 디코딩 비용으로 거의 최적에 가까운 성능을 달성한다.

English

Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.