スケーリングDoRA：因子分解ノルムと融合カーネルによる高ランク適応

要旨

重み分解型低ランク適応（DoRA）は、LoRAを拡張し重みの大きさと方向を分離するが、その順方向計算にはW + sBAの行ごとのノルムが必要であり、我々が調査した主要フレームワークの全ては、密な[d_out, d_in]積BAを実体化することでこの計算を実装している。d_in = 8192、ランクr = 384の場合、単一モジュールのノルム計算にはbf16で約512MBの一時作業メモリが必要となり、数百の適応モジュールとチェックポイント処理が伴う一般的なシングルGPU環境では、高ランクのDoRAはコストが高く、多くの場合実行不可能となる。本論文では2つのシステム面での貢献を示す。分解ノルム法は、二乗ノルムを基底項、交差項、グラム項に分解し、O(d_out r + r^2)の中間データを通じて計算可能とし、密な積計算を不要にする。融合型Tritonカーネルは、4つのカーネルからなるDoRA合成処理を単一パスに統合し、メモリ転送量を約4分の1に削減するとともに、実際に大きさのスケールが集中するほぼ1の再スケーリング領域において数値的に安定し、桁落ちを回避する形式を採用する。 bf16、r=384の条件下で、3種類のNVIDIA GPU（RTX 6000 PRO, H200, B200）上で6つの8-32B規模ビジョン言語モデル（VLM）を評価した結果、融合型実装は、推論においてHugging Face PEFTのDoRA実装比1.5-2.0倍、勾配計算（オプティマイザステップ除く）において1.5-1.9倍高速であり、ピークVRAM使用量は最大7GB低減した。4世代のアーキテクチャに跨る6種類のGPU（L40S, A100, RTX 6000 PRO, H200, B200, B300）でのマイクロベンチマークにより、合成カーネルで1.5-2.7倍の高速化を確認した。全てのモデル/GPUペアにおいて最終出力ロジットのコサイン類似度は0.9999を超え、2000ステップにわたるマルチシード学習曲線はステップ当たり平均損失差7.1 x 10^-4以内で一致した。

English

Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.

スケーリングDoRA：因子分解ノルムと融合カーネルによる高ランク適応

Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

要旨

Support