ChatPaper.aiChatPaper

Scone:通过统一理解-生成建模弥合主题驱动图像生成中的组合与区分鸿沟

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

December 14, 2025
作者: Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang
cs.AI

摘要

主题驱动的图像生成已从单主体组合发展到多主体组合,但普遍忽视了区分能力——即在输入包含多个候选主体时准确识别并生成正确主体的能力。这一局限影响了模型在复杂真实视觉场景中的有效性。我们提出Scone,一种统一的理解-生成方法,将组合与区分能力相融合。Scone使理解专家充当语义桥梁,传递语义信息并引导生成专家在保持主体身份的同时最小化干扰。采用两阶段训练方案:先学习组合能力,再通过语义对齐和基于注意力的掩码机制增强区分能力。我们还推出了SconeEval基准,用于评估多样化场景下的组合与区分性能。实验表明,在两项基准测试中,Scone在组合与区分任务上均优于现有开源模型。我们的模型、基准及训练数据已开源:https://github.com/Ryann-Ran/Scone。
English
Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.
PDF401December 18, 2025