위험 없는 오픈 가중치 모델을 향하여: LLM에서 공공 능력과 비공개 능력의 분리

초록

오픈웨이트 대규모 언어 모델(LLM)은 과학적 진보와 광범위한 배포를 가능하게 한다. 그러나 민감한 기능에 대한 접근 통제를 어렵게 만든다. 현재의 관행은 출시 전에 위험한 기능을 억제하거나, 특수 모델 변형, 입력/출력 모니터 및 API 권한을 사용하는 폐쇄 서비스를 통해 접근을 중재한다. 전자는 소수의 사용자가 제기하는 위험을 완화하기 위해 모든 사용자의 기능을 희생하면서 탈옥(jailbreak)에 취약하며, 후자는 근본적으로 오픈웨이트 출시와 양립할 수 없다. 본 논문에서는 단일 가중치 집합이 여러 기능 수준을 지원하는 계층적 언어 모델(TLM)을 제안한다. 기본 공개 구성에서 TLM은 기존 LLM처럼 동작한다. 컴팩트한 비밀 키는 작은 매개변수 부분집합에 대한 순열을 지정하여, 동일한 가중치 위에 추가 기능을 노출하는 대안적 계산 그래프를 유도한다. 우리는 두 구성을 처음부터 공동 사전학습한 후, 공개 모델의 행동을 보존하기 위해 정규화를 적용하여 비공개 데이터로 키가 있는 구성(keyed configuration)을 미세조정하는 학습 프로토콜을 개발한다. 우리는 1억 8천만 개 및 6억 5천만 개의 매개변수를 가진 TLM을 사전학습하고, 키가 있는 구성이 새로운 언어를 습득하고, 명령 수행 능력을 획득하며, 비공개 사실 지식을 기억할 수 있음을 입증하는 반면, 공개 구성은 이러한 능력을 전혀 나타내지 않음을 보여준다. 또한, 우리의 접근 방식이 여러 계층적 단계로 자연스럽게 확장된다는 것을 보여준다. 권한 부여가 입력 공간이 아닌 모델의 가중치 구조에서 작동하기 때문에, 이 메커니즘은 미세조정 기반 추출 및 부분 키 손상에 저항한다. 일반적으로 TLM은 오픈웨이트 출시와 선택적 기능 제어를 조화시키는 한 걸음을 내딛는다.

English

Open-weight Large Language Models (LLMs) enable scientific progress and broad deployment. However, they make it difficult to control access to sensitive capabilities. Current practice either suppresses dangerous capabilities before release or mediates access through closed services that use specialized model variants, input/output monitors, and API permissions. The former is susceptible to jailbreaks while sacrificing capability for all users to mitigate the risks posed by a few, and the latter is fundamentally incompatible with open-weight release. In this paper, we propose Tiered Language Models (TLMs), where a single set of released weights supports multiple capability levels. In its default public configuration, a TLM behaves as a conventional LLM. A compact secret key specifies a permutation over a small parameter subset, inducing an alternative computation graph over the same weights that exposes additional capabilities. We develop a training protocol that jointly pretrains both configurations from scratch, then fine-tunes the keyed configuration on private data with regularization to preserve the public model's behavior. We pretrain 180M- and 650M-parameter TLMs and demonstrate that the keyed configuration can acquire a new language, gain instruction-following ability, and memorize private factual knowledge, whereas the public configuration exhibits none of these capabilities. Moreover, we show that our approach extends naturally to multiple hierarchical tiers. Because authorization operates on the model's weight structure rather than in the input space, the mechanism resists fine-tuning-based extraction and partial key compromise. In general, TLMs take a step toward reconciling open-weight release with selective capability control.