MultiBind：多主体生成中属性绑定错误的基准测试

摘要

主题驱动的图像生成技术正日益被期望能够实现对单张图像中多个实体的细粒度控制。在多参考工作流中，用户可提供多张主体图像、背景参考以及带有实体索引的长文本提示，以控制同一场景中的多个人物。在此设定下，关键失效模式是跨主体属性错位——属性被保留、编辑或错误转移到其他主体。现有基准与指标大多强调整体保真度或单主体自相似性，导致此类故障难以诊断。我们推出MultiBind基准，该基准基于真实多人照片构建。每个实例提供带掩码与边界框的槽位有序主体裁剪图、规范化主体参考、修复后的背景参考，以及源自结构化标注的密集实体索引提示。我们还提出维度混淆评估方案：通过将生成主体与真实槽位匹配，并利用人脸身份、外观、姿态和表情的专用评估器测量槽位间相似度。通过减去对应的真实相似度矩阵，我们的方法能将自身性能衰减与真实跨主体干扰分离，并揭示可解释的失效模式（如漂移、置换、主导和融合）。在现代多参考生成器上的实验表明，MultiBind能揭示传统重建指标无法检测的绑定故障。

English

Subject-driven image generation is increasingly expected to support fine-grained control over multiple entities within a single image. In multi-reference workflows, users may provide several subject images, a background reference, and long, entity-indexed prompts to control multiple people within one scene. In this setting, a key failure mode is cross-subject attribute misbinding: attributes are preserved, edited, or transferred to the wrong subject. Existing benchmarks and metrics largely emphasize holistic fidelity or per-subject self-similarity, making such failures hard to diagnose. We introduce MultiBind, a benchmark built from real multi-person photographs. Each instance provides slot-ordered subject crops with masks and bounding boxes, canonicalized subject references, an inpainted background reference, and a dense entity-indexed prompt derived from structured annotations. We also propose a dimension-wise confusion evaluation protocol that matches generated subjects to ground-truth slots and measures slot-to-slot similarity using specialists for face identity, appearance, pose, and expression. By subtracting the corresponding ground-truth similarity matrices, our method separates self-degradation from true cross-subject interference and exposes interpretable failure patterns such as drift, swap, dominance, and blending. Experiments on modern multi-reference generators show that MultiBind reveals binding failures that conventional reconstruction metrics miss.