OpenSubject:利用影片衍生身份與多樣性先驗實現主體驅動的影像生成與操控 (注:此標題翻譯在保持專業術語準確性的同時,採用「影片衍生」對應"Video-Derived"以強調數據來源,將"Diversity Priors"譯為「多樣性先驗」符合機器學習領域術語規範,並通過「主體驅動」準確傳達"Subject-driven"的技術特徵,整體結構符合中文學術標題的簡潔性要求。)
OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation
December 9, 2025
作者: Yexin Liu, Manyuan Zhang, Yueze Wang, Hongyu Li, Dian Zheng, Weiming Zhang, Changsheng Lu, Xunliang Cai, Yan Feng, Peng Pei, Harry Yang
cs.AI
摘要
儘管主題驅動影像生成技術取得了顯著進展,現有模型仍常偏離參考主體特徵,並在處理多主體複雜場景時面臨挑戰。為解決此問題,我們推出OpenSubject——一個基於影片構建的大規模數據集,包含250萬個樣本和435萬張影像,專注於主題驅動生成與編輯任務。該數據集通過四階段流程構建,充分利用跨幀身份先驗:(i)影片篩選:通過解析度與美學過濾獲取高質量片段;(ii)跨幀主體挖掘與配對:採用基於視覺語言模型(VLM)的類別共識、局部定位及多樣性感知配對策略篩選影像對;(iii)身份保持參考影像合成:通過分割圖引導外繪技術生成主題驅動生成的輸入影像,並採用框引導內繪技術生成主題驅動編輯的輸入影像,結合幾何感知增強與不規則邊界侵蝕處理;(iv)驗證與標註:利用VLM驗證合成樣本,對失敗樣本重新執行第三階段合成流程,最終構建簡短與詳細描述標註。此外,我們建立涵蓋主題驅動生成與編輯的基準測試體系,通過VLM評判指標評估身份保真度、提示語遵循度、編輯一致性與背景一致性。大量實驗表明,採用OpenSubject訓練能有效提升生成與編輯效能,尤其在複雜場景中表現尤為突出。
English
Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.