MultiRef：基于多视觉参考的可控图像生成

摘要

视觉设计师自然地从多个视觉参考中汲取灵感，融合多样元素与美学原则以创作艺术作品。然而，当前的图像生成框架主要依赖单一来源的输入——无论是文本提示还是单个参考图像。本文聚焦于利用多视觉参考进行可控图像生成的任务。我们引入了MultiRef-bench，一个包含990个合成样本和1000个真实世界样本的严格评估框架，这些样本要求整合来自多张参考图像的视觉内容。合成样本通过我们的数据引擎RefBlend生成，涵盖10种参考类型和33种参考组合。基于RefBlend，我们进一步构建了包含38,000张高质量图像的数据集MultiRef，以促进深入研究。我们对三种交错图像-文本模型（即OmniGen、ACE和Show-o）及六种代理框架（如ChatDiT和LLM + SD）的实验表明，即便是最先进的系统在处理多参考条件时也面临挑战，最佳模型OmniGen在合成样本和真实案例中的平均表现分别仅为66.6%和79.0%，相较于黄金标准。这些发现为开发能够有效整合多源视觉灵感、更加灵活且类人的创意工具提供了宝贵方向。数据集已公开于：https://multiref.github.io/。

English

Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.

MultiRef：基于多视觉参考的可控图像生成

MultiRef: Controllable Image Generation with Multiple Visual References

摘要

Support