Scone:透過統一的理解-生成建模,在主題驅動影像生成中橋接組合性與區分性
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
December 14, 2025
作者: Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang
cs.AI
摘要
主體驅動影像生成已從單主體合成發展至多主體組合,卻長期忽視了「辨異」能力——即在輸入包含多個候選主體時準確識別並生成正確主體的能力。此限制影響了方法在複雜真實視覺場景中的有效性。我們提出Scone,一種融合組合與辨異能力的統一理解-生成框架。Scone使理解專家充當語義橋樑,傳遞語義信息並引導生成專家在保持主體身份特徵的同時最小化干擾。我們採用兩階段訓練策略:先學習組合能力,再通過語義對齊與基於注意力機制的遮罩增強辨異能力。同時提出SconeEval基準,用於評估多樣化場景下的組合與辨異性能。實驗表明,Scone在兩個基準測試的多主體組合與辨異任務上均超越現有開源模型。我們的模型、基準及訓練數據已開源於:https://github.com/Ryann-Ran/Scone。
English
Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.