リスクのないオープン重みモデルに向けて：LLMにおける公開能力と非公開能力の分離

要旨

オープンウェイトの大規模言語モデル（LLM）は、科学的進歩と幅広い展開を可能にする。しかしその一方で、機密性の高い能力へのアクセスを制御することが困難になる。現行の手法では、リリース前に危険な能力を抑制するか、特殊なモデルバリアント、入出力モニター、API権限を利用したクローズドサービスを通じてアクセスを仲介するかのいずれかである。前者はジェイルブレイクの影響を受けやすく、少数のユーザーがもたらすリスクを軽減するために、全ユーザーの能力を犠牲にする。後者はオープンウェイトリリースと根本的に相容れない。本論文では、単一のリリースされた重みセットで複数の能力レベルをサポートするTiered Language Models（TLM）を提案する。デフォルトの公開構成では、TLMは従来のLLMとして動作する。コンパクトな秘密鍵が小さなパラメータサブセット上の置換を指定し、同じ重み上に代替計算グラフを誘導して追加の能力を露出させる。我々は、両方の構成をスクラッチから共同で事前学習し、その後、公開モデルの振る舞いを維持するための正則化を施したプライベートデータで鍵付き構成を微調整する訓練プロトコルを開発する。180Mパラメータおよび650MパラメータのTLMを事前学習し、鍵付き構成が新しい言語を習得し、指示追従能力を獲得し、プライベートな事実知識を記憶できる一方、公開構成はこれらの能力を一切示さないことを実証する。さらに、本アプローチが複数の階層的ティアに自然に拡張可能であることを示す。認証は入力空間ではなくモデルの重み構造上で動作するため、本メカニズムは微調整ベースの抽出や部分的な鍵の漏洩に耐性がある。TLMは一般に、オープンウェイトリリースと選択的な能力制御の調和に向けた一歩となる。

English

Open-weight Large Language Models (LLMs) enable scientific progress and broad deployment. However, they make it difficult to control access to sensitive capabilities. Current practice either suppresses dangerous capabilities before release or mediates access through closed services that use specialized model variants, input/output monitors, and API permissions. The former is susceptible to jailbreaks while sacrificing capability for all users to mitigate the risks posed by a few, and the latter is fundamentally incompatible with open-weight release. In this paper, we propose Tiered Language Models (TLMs), where a single set of released weights supports multiple capability levels. In its default public configuration, a TLM behaves as a conventional LLM. A compact secret key specifies a permutation over a small parameter subset, inducing an alternative computation graph over the same weights that exposes additional capabilities. We develop a training protocol that jointly pretrains both configurations from scratch, then fine-tunes the keyed configuration on private data with regularization to preserve the public model's behavior. We pretrain 180M- and 650M-parameter TLMs and demonstrate that the keyed configuration can acquire a new language, gain instruction-following ability, and memorize private factual knowledge, whereas the public configuration exhibits none of these capabilities. Moreover, we show that our approach extends naturally to multiple hierarchical tiers. Because authorization operates on the model's weight structure rather than in the input space, the mechanism resists fine-tuning-based extraction and partial key compromise. In general, TLMs take a step toward reconciling open-weight release with selective capability control.