迈向无风险的开放权重模型:分离大语言模型中的公共与私有能力
Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in LLMs
June 19, 2026
作者: Charbel El Feghali, Arkil Patel, Nicholas Meade, Spandana Gella, Verna Dankers, Siva Reddy
cs.AI
摘要
开放权重的大型语言模型(LLM)推动了科学进步和广泛部署,但也使控制敏感能力的访问变得困难。当前的做法要么在发布前压制危险能力,要么通过封闭服务(使用专门的模型变体、输入/输出监控和API权限)来中介访问。前者易受越狱攻击,且为缓解少数用户带来的风险而牺牲所有用户的能力;后者则与开放权重发布根本不相容。本文提出分级语言模型(TLM),其中一套发布的权重支持多种能力级别。在默认公共配置下,TLM表现为传统LLM。一个紧凑的密钥指定了一个小参数子集上的置换,在相同权重上诱导出替代计算图,从而暴露额外能力。我们开发了一种训练协议,从头联合预训练两种配置,然后在私有数据上对密钥化配置进行微调并加入正则化以保持公共模型的行为。我们预训练了1.8亿和6.5亿参数的TLM,并证明密钥化配置能够习得新语言、获得指令遵循能力并记忆私有事实知识,而公共配置则完全不具有这些能力。此外,我们证明该方法可自然扩展到多层分级结构。由于授权作用于模型权重结构而非输入空间,该机制能抵抗基于微调的提取和部分密钥泄露。总体而言,TLM在调和开放权重发布与选择性能力控制方面迈出了一步。
English
Open-weight Large Language Models (LLMs) enable scientific progress and broad deployment. However, they make it difficult to control access to sensitive capabilities. Current practice either suppresses dangerous capabilities before release or mediates access through closed services that use specialized model variants, input/output monitors, and API permissions. The former is susceptible to jailbreaks while sacrificing capability for all users to mitigate the risks posed by a few, and the latter is fundamentally incompatible with open-weight release. In this paper, we propose Tiered Language Models (TLMs), where a single set of released weights supports multiple capability levels. In its default public configuration, a TLM behaves as a conventional LLM. A compact secret key specifies a permutation over a small parameter subset, inducing an alternative computation graph over the same weights that exposes additional capabilities. We develop a training protocol that jointly pretrains both configurations from scratch, then fine-tunes the keyed configuration on private data with regularization to preserve the public model's behavior. We pretrain 180M- and 650M-parameter TLMs and demonstrate that the keyed configuration can acquire a new language, gain instruction-following ability, and memorize private factual knowledge, whereas the public configuration exhibits none of these capabilities. Moreover, we show that our approach extends naturally to multiple hierarchical tiers. Because authorization operates on the model's weight structure rather than in the input space, the mechanism resists fine-tuning-based extraction and partial key compromise. In general, TLMs take a step toward reconciling open-weight release with selective capability control.