언어 모델 파라미터 역학에 대한 통합적 관점: 조정(Steering)이 작동하는 이유

초록

대규모 언어 모델(LLM)을 제어하는 방법(로컬 가중치 미세 조정, LoRA 기반 적응, 활성화 기반 개입 등)은 종종 개별적으로 연구되어 이들 간의 연결성이 불분명하고 비교가 어려운 상황입니다. 본 연구에서는 이러한 개입 방법들을 제어 신호에 의해 유도된 동적 가중치 업데이트로 규정하는 통합된 관점을 제시하며, 이를 단일 개념 체계 내에 위치시킵니다. 이러한 관점을 바탕으로, 우리는 제어 효과를 '특정 대상 개념을 향한 경향성'으로 정의된 선호도와 '일관적이고 작업에 유효한 생성'으로 정의된 유용성으로 분리하고, 극성 대조 예제를 사용하여 공통 로그 오즈 척도로 두 가지를 모두 측정하는 통합 선호도-유용성 분석법을 제안합니다. 다양한 방법론에 걸쳐 우리는 선호도와 유용성 사이에 일관된 트레이드오프가 존재함을 관찰합니다. 즉, 제어 강도가 강해질수록 선호도는 증가하지만 예측 가능하게 유용성은 감소합니다. 우리는 이러한 현상을 활성화 매니폴드 관점을 통해 추가적으로 설명하는데, 여기서 제어는 대상 개념 방향으로 표현을 이동시켜 선호도를 향상시키는 반면, 개입이 표현을 모델의 유효 생성 매니폴드에서 벗어나게 밀어낼 때 주로 유용성이 저하됩니다. 마지막으로, 우리는 이 분석을 바탕으로 선호도를 개선하면서 유용성을 더 잘 보존하는 새로운 조정 기법인 SPLIT을 소개합니다. 코드는 https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md에서 확인할 수 있습니다.

English

Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.

언어 모델 파라미터 역학에 대한 통합적 관점: 조정(Steering)이 작동하는 이유

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

초록

Support