ChatPaper.aiChatPaper

SPINAL -- 神经对齐层中的尺度律与偏好融合

SPINAL -- Scaling-law and Preference Integration in Neural Alignment Layers

January 8, 2026
作者: Arion Das, Partha Pratim Saha, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das
cs.AI

摘要

直接偏好优化(DPO)是一种基于原则、可扩展的RLHF替代方案,用于根据成对偏好对齐大语言模型,但其内部几何特征尚未得到充分表征,限制了模型审计、检查点比较和故障预测的能力。我们提出SPINAL(神经对齐层中的缩放律与偏好整合诊断法),通过逐层追踪局部结构变化,量化对齐过程如何重塑不同深度的表征。跨模型族实验表明,DPO会产生层间校准效应,该效应集中体现在最终解码块(通常为21-30层)——偏好梯度在此处对下一词元分布产生最直接影响。SPINAL将每个检查点编码为包含(层索引、收缩分数、传输分数)的深度轨迹。收缩分数表征层谱尾部的衰减速度(小微模态的消失速率),数值越高表明表征向更少有效方向的收缩越强;传输分数通过有界重叠度量相邻层间词元分布的偏移程度,数值越低表征在表示空间中的移动越短促平滑。对齐后的检查点呈现末层收缩强度跃升与传输平滑下降的特征,符合策略质量紧致化和稳定化的趋势,而未对齐模型则表现出高曲率、高熵值及几何不连贯的深度路径。总体而言,对齐具有几何局部性:末层编码了由偏好主导的修正量。SPINAL将这种局部性转化为实用审计信号,可量化对齐作用的集中位置、强度表征及其在训练过程中的失稳临界点。
English
Direct Preference Optimization (DPO) is a principled, scalable alternative to RLHF for aligning large language models from pairwise preferences, but its internal geometric footprint remains undercharacterized, limiting audits, checkpoint comparisons, and failure prediction. We introduce SPINAL (Scaling-law and Preference Integration in Neural Alignment Layers), a diagnostic that measures how alignment reshapes representations across depth by tracing localized structural change layer by layer. Across model families, DPO produces a layerwise calibration effect concentrated in the final decoder blocks (often layers 21-30), where preference gradients most directly affect the next-token distribution. SPINAL encodes each checkpoint as a depth trace over (layer index, contraction score, transport score). The contraction score summarizes how quickly the tail of a layer's spectrum decays (how fast small modes vanish); higher values indicate stronger contraction into fewer effective directions. The transport score summarizes how much the token distribution shifts between adjacent layers using a bounded overlap measure; lower values indicate shorter, smoother steps through representation space. Aligned checkpoints show a late-layer ramp-up in contraction and a smooth reduction in transport, consistent with tightened and stabilized policy mass, while unaligned models trace higher-curvature, more entropic, and geometrically incoherent depth paths. Overall, alignment is geometrically localized: the final layers encode the dominant preference-induced corrections. SPINAL turns this localization into a practical audit signal, quantifying where alignment concentrates, how strongly it manifests, and when it begins to destabilize during training.
PDF12February 7, 2026