最优控制邂逅流匹配：通往多主体保真度的理论路径

摘要

文本到图像（T2I）模型在处理单一实体提示时表现出色，但在面对多主体描述时却常显乏力，往往出现属性泄露、身份混淆以及主体遗漏等问题。我们首次提出了一套理论框架，其核心在于确立了一个可优化的原则性目标，旨在引导采样动态向多主体保真度靠拢。通过将流匹配（FM）置于随机最优控制（SOC）视角下，我们将主体解耦问题转化为对已训练FM采样器的控制任务。这一思路催生了两种架构无关的算法：（一）一种无需训练的测试时控制器，它通过单次更新扰动基础速度场；（二）伴随匹配，一种轻量级微调规则，它通过回归控制网络至反向伴随信号，同时保留基础模型的能力。该框架不仅统一了先前的注意力启发式方法，还通过流-扩散对应关系扩展至扩散模型，并首次提供了专为多主体保真度设计的微调路径。实证表明，在Stable Diffusion 3.5、FLUX及Stable Diffusion XL上，这两种算法均能持续提升多主体对齐效果，同时保持基础模型风格。测试时控制器能在普通GPU上高效运行，而基于有限提示训练的微调控制器还能泛化至未见过的提示。我们进一步展示了FOCUS（面向无纠缠主体的流最优控制），它在各模型中实现了最先进的多主体保真度。

English

Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.

最优控制邂逅流匹配：通往多主体保真度的理论路径

Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity

摘要

Support