缩放DoRA：基于分解范数与融合核的高秩自适应方法

摘要

权重分解的低秩自适应方法(DoRA)通过将权重幅度与方向解耦扩展了LoRA，但其前向传播需要计算W + sBA的行范数，而我们所调研的主流框架均通过实例化稠密的[d_out, d_in]乘积BA来实现该计算。当d_in=8192且秩r=384时，单个模块的范数计算在bf16精度下需要约512 MB的瞬态工作内存，这使得高秩DoRA在涉及数百个适配模块和检查点机制时成本高昂，且在常见单GPU环境中往往不可行。我们提出两项系统优化贡献：分解式范数计算将平方范数拆解为可经由O(d_out r + r^2)中间量计算的基础项、交叉项和格拉姆项，从而消除稠密乘积计算；融合式Triton内核将四内核DoRA组合压缩为单次运算，减少约4倍内存流量，并采用数值稳定形式避免在幅度缩放集中于单位尺度的实际场景中出现灾难性相消。在bf16精度下对六个8-32B视觉语言模型(VLM)进行r=384的测试（覆盖三款NVIDIA GPU：RTX 6000 PRO/H200/B200），融合实现相比Hugging Face PEFT的DoRA实现推理速度提升1.5-2.0倍，梯度计算速度（不含优化器步）提升1.5-1.9倍，峰值显存降低最高7 GB。跨四代架构的六款GPU微基准测试（L40S/A100/RTX 6000 PRO/H200/B200/B300）证实组合内核加速达1.5-2.7倍。所有模型/GPU配对的最终逻辑输出余弦相似度超过0.9999，多随机种子训练曲线在2000步内的平均单步损失差异小于7.1×10^-4。

English

Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.