主题:利用视频衍生的身份与多样性先验实现主体驱动的图像生成与操控
OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation
December 9, 2025
作者: Yexin Liu, Manyuan Zhang, Yueze Wang, Hongyu Li, Dian Zheng, Weiming Zhang, Changsheng Lu, Xunliang Cai, Yan Feng, Peng Pei, Harry Yang
cs.AI
摘要
尽管主题驱动图像生成领域取得了显著进展,但现有模型仍常偏离参考身份特征,且在多主体复杂场景中表现欠佳。为解决这一挑战,我们推出OpenSubject——一个基于视频的大规模数据集,包含250万样本和435万张图像,专为主题驱动生成与编辑任务构建。该数据集通过利用跨帧身份先验的四阶段流程构建:(一)视频筛选:通过分辨率与美学过滤获取高质量视频片段;(二)跨帧主题挖掘与配对:基于视觉语言模型的类别共识、局部定位及多样性感知配对策略筛选图像对;(三)身份保持参考图像合成:采用分割图引导的外绘技术生成主题驱动生成的输入图像,结合框引导的内绘技术生成主题驱动编辑的输入图像,辅以几何感知增强与不规则边界侵蚀;(四)验证与标注:使用视觉语言模型验证合成样本,对失败样本基于第三阶段重新合成,并构建长短文本描述。此外,我们建立了涵盖主题驱动生成与编辑的基准测试体系,通过视觉语言模型评估身份保真度、提示符遵循度、编辑一致性与背景一致性。大量实验表明,基于OpenSubject的训练能显著提升生成与编辑性能,尤其在复杂场景中表现突出。
English
Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.