微調整反転の重力解釈

要旨

無害データによるファインチューニングは、訓練の初期に獲得された挙動を部分的に元に戻すことができる。安全性は無害なアライメント後更新のもとで侵食され得、学習解除された能力は再出現し得、潜在特性は一見無関係な教師信号を通じて転移し得、そして関連するアライメント後の脆弱性は他の生成設定でも現れる。我々は、これらの現象が共通の学習履歴のレンズを通して見ることで有益であると論じる。我々の仮説は幾何学的なものである：大規模な初期学習フェーズは支配的な行動多様体を作り出し、その後のアライメントや特化フェーズはそれらからの浅い変位である。したがって、その後のファインチューニングは、支配的多様体の指標に向かって戻る持続的な回帰成分を受け継ぐことができる。我々はこれをファインチューニング回帰の重力解釈と呼ぶ。主たる設定全体において、表現のドリフトは急速に、履歴で定義された回帰方向（v_rev）に沿った成分を獲得する。メインのトラックでは、v_revとのアライメント（コサイン類似度）は最初の更新後の0.429 ± 0.052から、ステップ20では0.647 ± 0.021まで上昇する。24の実行-ステップペアにわたり、観測されたすべてのアライメントは等方的活性化空間の帰無仮説のp99を超えている。我々は、v_revに沿った動きを選択的に遮断することで、T=100における最終アライメントが0.648 ± 0.009から-0.211 ± 0.021に変化し、有害性が19.0% ± 4.0%から8.5% ± 1.5%に減少し、タスクコストがほとんど生じないことを示す。これらの結果は、我々のセットアップにおいてv_revがアライメント後の初期回帰の因果的に関連する媒介因子であることを支持する。重要なのは、我々はv_revが唯一の安全方向であるとか、支配的多様体が直接観測されると主張するわけではない。むしろ、我々は初期回帰の動態を説明し部分的に制御する、頑健で履歴で定義された方向を特定する。

English

Fine-tuning on harmless data can partially undo behaviors acquired earlier in training. Safety can erode under benign post-alignment updates, unlearned capabilities can re-emerge, latent traits can transfer through apparently unrelated supervision, and related post-alignment fragility appears in other generative settings. We argue these phenomena are usefully viewed through a common training-history lens. Our hypothesis is geometric: large early training phases create dominant behavioral manifolds, while later alignment or specialization phases are shallower displacements from them. Subsequent fine-tuning can therefore inherit a persistent reversion component pointing back toward a witness of the dominant manifold. We call this the gravitational interpretation of fine-tuning reversion. Across our main settings, representational drift rapidly acquires a component along a history-defined reversion direction (v_rev). In our main track, alignment with v_rev rises from cos = 0.429 +/- 0.052 after the first update to 0.647 +/- 0.021 by step 20. Across 24 run-step pairs, every observed alignment exceeds the p99 of an isotropic activation-space null. We demonstrate that selectively blocking motion along v_rev changes the final alignment at T=100 from 0.648 +/- 0.009 to -0.211 +/- 0.021 and reduces harmfulness from 19.0% +/- 4.0% to 8.5% +/- 1.5% with little task cost. These results support v_rev as a causally relevant mediator of early post-alignment reversion in our setup. Importantly, we do not claim that v_rev is the unique safety direction, nor that the dominant manifold is directly observed; rather, we identify a robust, history-defined direction that explains and partially controls early reversion dynamics.