LISA: 基于似然得分对齐的视觉条件可控生成

摘要

当前主流的双分支范式，即通过训练侧网络编码视觉条件，并将其中间层特征融合到冻结的预训练主网络中，已在视觉条件可控生成任务中取得了显著成功。尽管这一方法被广泛采用，但侧分支的作用及其训练效率仍未被充分探索。本文首先从基于分数的生成建模视角重新审视这一主流范式：1) 主网络通过提供先验无条件分数来保持视觉感知质量；2) 侧网络通过隐式贡献似然分数来引导条件控制。基于此视角，我们提出似然分数对齐（LISA），一种有效的正则化方法，通过显式地将侧网络的中间层特征与近似的似然分数目标对齐。具体而言，我们首先从侧网络的指定层钩取特征，并通过轻量级解码器将其投影到分数隐空间。随后，我们构建近似的似然分数目标，计算解码器输出与该目标之间的距离作为额外正则化损失。最后，我们联合优化侧网络和解码器，同时使用标准扩散损失与正则化损失。在多种图像/视频任务、架构以及扩散/流模型上的实验表明，LISA不仅能持续加速训练收敛、提升最终合成结果质量，还能促使侧网络特征在条件建模中更具解耦性，且仅增加极小的训练成本，推理阶段则无额外开销。

English

The prevalent dual-branch paradigm, i.e., training a side network to encode visual conditions and fusing its intermediate-layer features to a frozen pretrained main network, has shown remarkable success in visual-condition controllable generation. Despite its widespread adoption, the role of the side branch and its training efficiency remain underexplored. In this paper, we first revisit this mainstream paradigm through the lens of score-based generative modeling: 1) The main network preserves visual perceptual quality by providing a prior unconditional score. 2) The side network steers conditional control by implicitly contributing a likelihood score. Guided by this perspective, we propose LIkelihood Score Alignment (LISA), an effective regularization method that explicitly aligns the intermediate feature of the side network with an approximated likelihood score. Specifically, we first hook features from a designated layer of the side network and project them into the score latent space by a lightweight decoder. Then, we construct an approximated likelihood score target and calculate the distance between the decoder's output and this target as an additional regularization loss. Finally, we jointly optimize the side network and decoder with both standard diffusion loss and our regularization loss. Experiments across various image/video tasks, architectures, and diffusion/flow models demonstrated that LISA can not only consistently accelerate the training convergence and improve final synthetic results, but also encourage the side network's features to be more disentangled for conditional modeling with negligible additional training cost and zero extra inference cost.