MultiRef：基於多重視覺參考的可控圖像生成

摘要

視覺設計師自然會從多個視覺參考中汲取靈感，結合多樣化的元素和美學原則來創作藝術作品。然而，當前的圖像生成框架主要依賴於單一來源的輸入——無論是文本提示還是單個參考圖像。本文聚焦於利用多個視覺參考進行可控圖像生成的任務。我們引入了MultiRef-bench，這是一個嚴謹的評估框架，包含990個合成樣本和1,000個真實世界樣本，這些樣本要求整合來自多個參考圖像的視覺內容。合成樣本通過我們的數據引擎RefBlend生成，涵蓋10種參考類型和33種參考組合。基於RefBlend，我們進一步構建了一個包含38k高質量圖像的數據集MultiRef，以促進進一步的研究。我們在三個交織的圖像-文本模型（即OmniGen、ACE和Show-o）和六個代理框架（如ChatDiT和LLM + SD）上的實驗表明，即使是最先進的系統在多參考條件下也面臨挑戰，最佳模型OmniGen在合成樣本中僅達到66.6%，在真實世界案例中平均達到79.0%，與黃金答案相比。這些發現為開發更靈活、更接近人類創造力的工具提供了寶貴的方向，這些工具能夠有效整合多個視覺靈感來源。數據集公開於：https://multiref.github.io/。

English

Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.

MultiRef：基於多重視覺參考的可控圖像生成

MultiRef: Controllable Image Generation with Multiple Visual References

摘要

Support