MARBLE：擴散強化學習中的多面向獎勵平衡

摘要

強化學習微調已成為對齊擴散模型與人類偏好的主流方法。然而圖像評估本質上是多維度任務，需要同時優化多項評估標準。現有方法通過為每個獎勵訓練專用模型、優化加權總和獎勵R(x)=∑k wk Rk(x)，或採用手工設計的階段式順序微調來處理多重獎勵。這些方法要么無法產生可聯合訓練所有獎勵的統一模型，要么需要耗費大量人工調優的順序訓練。我們發現問題根源在於使用簡單的加權總和獎勵聚合方式。這種方法存在樣本級別失配問題：多數訓練軌跡都是專項樣本，對某些獎勵維度信息豐富卻與其他維度無關，導致加權求和稀釋了其監督作用。為解決此問題，我們提出MARBLE（多維度獎勵平衡）框架，通過梯度空間優化維護各獎勵的獨立優勢估計器，計算每項獎勵的策略梯度，並通過求解二次規劃問題將其協調為單一更新方向，無需人工調節獎勵權重。我們進一步提出利用DiffusionNFT損失函數仿射結構的攤銷化方案，將每步計算成本從K+1次反向傳播降至接近單獎勵基線水平，並結合平衡係數的指數移動平均平滑技術，有效抵禦瞬時單批次波動對更新的影響。在配備五項獎勵的SD3.5 Medium模型上，MARBLE能同步提升所有五個獎勵維度，將加權求和時80%小批次中最差對齊獎勵的梯度餘弦值從負轉正，且訓練速度達到基線的0.97倍。

English

Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward R(x)=sum_k w_k R_k(x), or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naive weighted-sum reward aggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving a Quadratic Programming problem. We further propose an amortized formulation that exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward's gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.

MARBLE：擴散強化學習中的多面向獎勵平衡

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

摘要

Support