缩放DoRA:基于因子化范数与融合核的高秩自适应方法
Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels
March 23, 2026
作者: Alexandra Zelenin, Alexandra Zhuravlyova
cs.AI
摘要
權重分解的低秩適配(DoRA)通過將權重幅度與方向解耦來擴展LoRA,但其前向傳播需計算W + sBA的行範數,而我們調查的所有主流框架均通過實例化稠密的[d_out, d_in]乘積BA來實現該計算。當d_in=8192且秩r=384時,單一模組的範數計算在bf16精度下需要約512 MB的瞬態工作記憶體,這使得高秩DoRA在涉及數百個適配模組和檢查點機制時成本高昂,且在多數單GPU環境中難以實現。
我們提出兩項系統級改進:因式分解範數將平方範數拆解為可通過O(d_out r + r^2)中間量計算的基礎項、交叉項和格拉姆項,從而消除稠密乘積運算;融合Triton核心將四核心DoRA組合操作壓縮為單次處理,減少約4倍的記憶體傳輸量,並採用數值穩定形式避免在實際幅度縮放集中近單位縮放區域出現災難性抵消。
在三款NVIDIA GPU(RTX 6000 PRO/H200/B200)上對六個80億至320億參數的視覺語言模型進行bf16精度下r=384的測試表明:融合實現在推理時比Hugging Face PEFT的DoRA實現快1.5-2.0倍,梯度計算(不含優化器步驟)快1.5-1.9倍,峰值顯存降低最高達7 GB。跨四代架構(L40S/A100/RTX 6000 PRO/H200/B200/B300)的六款GPU微基準測試確認組合核心速度提升達1.5-2.7倍。所有模型/GPU組合的最終邏輯餘弦相似度均超過0.9999,且多種子訓練曲線在2000步內的每步損失差值均值保持在7.1×10^-4以內。
English
Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved.
We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice.
Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.