LISA: 視覚条件制御可能生成のための尤度スコアアライメント

要旨

広く普及しているデュアルブランチパラダイム、すなわち、視覚的条件を符号化するためのサイドネットワークを学習し、その中間層特徴量を凍結済みの事前学習メインネットワークに融合するアプローチは、視覚的条件に基づく制御可能な生成において顕著な成功を収めている。広く採用されているにもかかわらず、サイドブランチの役割とその学習効率は十分に探求されていない。本稿では、まずスコアベース生成モデリングの観点からこの主流パラダイムを再考する：1) メインネットワークは事前無条件スコアを提供することで視覚的知覚品質を維持する。2) サイドネットワークは暗黙的に尤度スコアを寄与することで条件制御を導く。この視点に基づき、我々はLIkelihood Score Alignment (LISA) を提案する。これは、サイドネットワークの中間特徴量を近似尤度スコアと明示的に整列させる効果的な正則化手法である。具体的には、まずサイドネットワークの指定された層から特徴量をフックし、軽量デコーダによりそれらをスコア潜在空間へ射影する。次に、近似尤度スコアのターゲットを構築し、デコーダの出力とこのターゲットとの距離を追加の正則化損失として計算する。最後に、標準的な拡散損失と我々の正則化損失の両方を用いて、サイドネットワークとデコーダを共同最適化する。様々な画像/映像タスク、アーキテクチャ、拡散モデル/フローモデルにわたる実験により、LISAは学習収束を一貫して加速し最終的な合成結果を改善するだけでなく、無視できる追加学習コストとゼロの追加推論コストで、条件モデリングのためにサイドネットワークの特徴量をより分離されたものにすることが示された。

English

The prevalent dual-branch paradigm, i.e., training a side network to encode visual conditions and fusing its intermediate-layer features to a frozen pretrained main network, has shown remarkable success in visual-condition controllable generation. Despite its widespread adoption, the role of the side branch and its training efficiency remain underexplored. In this paper, we first revisit this mainstream paradigm through the lens of score-based generative modeling: 1) The main network preserves visual perceptual quality by providing a prior unconditional score. 2) The side network steers conditional control by implicitly contributing a likelihood score. Guided by this perspective, we propose LIkelihood Score Alignment (LISA), an effective regularization method that explicitly aligns the intermediate feature of the side network with an approximated likelihood score. Specifically, we first hook features from a designated layer of the side network and project them into the score latent space by a lightweight decoder. Then, we construct an approximated likelihood score target and calculate the distance between the decoder's output and this target as an additional regularization loss. Finally, we jointly optimize the side network and decoder with both standard diffusion loss and our regularization loss. Experiments across various image/video tasks, architectures, and diffusion/flow models demonstrated that LISA can not only consistently accelerate the training convergence and improve final synthetic results, but also encourage the side network's features to be more disentangled for conditional modeling with negligible additional training cost and zero extra inference cost.