MARBLE：扩散强化学习中多维度奖励平衡

摘要

强化学习微调已成为扩散模型与人类偏好对齐的主流方法。然而图像评估本质上是多维任务，需要同时优化多个评估标准。现有方法通过为每个奖励训练专用模型、优化加权奖励总和R(x)=∑ₖ wₖ Rₖ(x)，或采用手工设计的阶段调度进行顺序微调来处理多奖励问题。这些方法要么无法生成可联合训练所有奖励的统一模型，要么需要大量人工调参的顺序训练。我们发现失败根源在于使用简单的加权奖励聚合方法，该方法存在样本级失配问题——多数训练样本是专项样本，对某些奖励维度信息丰富但对其他维度无关紧要，导致加权求和稀释了其监督效果。为此我们提出MARBLE（多维度奖励平衡），一种梯度空间优化框架：为每个奖励维护独立优势估计器，计算各奖励的策略梯度，通过求解二次规划问题将其协调为单一更新方向而无需人工调优权重。我们进一步提出摊销化方案，利用DiffusionNFT损失的仿射结构将每步计算成本从K+1次反向传播降至接近单奖励基线水平，并结合平衡系数的指数移动平均平滑来抵御瞬时批量波动。在配备五种奖励的SD3.5 Medium模型上，MARBLE同时提升所有五个奖励维度，将加权求和中80%小批次的最差对齐奖励梯度余弦值从负转正，并以基线训练0.97倍的速度运行。

English

Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward R(x)=sum_k w_k R_k(x), or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naive weighted-sum reward aggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving a Quadratic Programming problem. We further propose an amortized formulation that exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward's gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.

MARBLE：扩散强化学习中多维度奖励平衡

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

摘要

Support