最優控制遇上流匹配:通往多主體保真度的原則性路徑
Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity
October 2, 2025
作者: Eric Tillmann Bill, Enis Simsar, Thomas Hofmann
cs.AI
摘要
文本到圖像(T2I)模型在單一實體提示上表現出色,但在多主體描述上卻常常力不從心,經常出現屬性洩漏、身份糾纏和主體遺漏等問題。我們首次引入了一個理論框架,該框架具有可優化的原則性目標,用於引導採樣動態朝向多主體保真度。通過隨機最優控制(SOC)的視角來審視流匹配(FM),我們將主體解纏結表述為對已訓練FM採樣器的控制。這產生了兩種與架構無關的算法:(i)一種無需訓練的測試時控制器,通過單次更新擾動基礎速度;(ii)伴隨匹配,一種輕量級的微調規則,將控制網絡迴歸到反向伴隨信號,同時保留基礎模型的能力。該公式統一了先前的注意力啟發式方法,通過流-擴散對應關係擴展到擴散模型,並提供了首個專為多主體保真度設計的微調路徑。實證上,在Stable Diffusion 3.5、FLUX和Stable Diffusion XL上,這兩種算法均能持續提升多主體對齊,同時保持基礎模型的風格。測試時控制能在商用GPU上高效運行,且基於有限提示訓練的微調控制器能泛化到未見的提示。我們進一步強調了FOCUS(用於無糾結主體的流最優控制),它在各模型中實現了最先進的多主體保真度。
English
Text-to-image (T2I) models excel on single-entity prompts but struggle with
multi-subject descriptions, often showing attribute leakage, identity
entanglement, and subject omissions. We introduce the first theoretical
framework with a principled, optimizable objective for steering sampling
dynamics toward multi-subject fidelity. Viewing flow matching (FM) through
stochastic optimal control (SOC), we formulate subject disentanglement as
control over a trained FM sampler. This yields two architecture-agnostic
algorithms: (i) a training-free test-time controller that perturbs the base
velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight
fine-tuning rule that regresses a control network to a backward adjoint signal
while preserving base-model capabilities. The same formulation unifies prior
attention heuristics, extends to diffusion models via a flow-diffusion
correspondence, and provides the first fine-tuning route explicitly designed
for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and
Stable Diffusion XL, both algorithms consistently improve multi-subject
alignment while maintaining base-model style. Test-time control runs
efficiently on commodity GPUs, and fine-tuned controllers trained on limited
prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal
Control for Unentangled Subjects), which achieves state-of-the-art
multi-subject fidelity across models.