MultiBanana:多参考文本到图像生成领域的挑战性基准
MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
November 28, 2025
作者: Yuta Oshima, Daiki Miyake, Kohsei Matsutani, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta
cs.AI
摘要
近期,文本到图像生成模型已具备多参考生成与编辑能力——能够继承多张参考图像中主体的外观特征,并在新语境下进行重新渲染。然而,现有基准数据集通常聚焦于单张或少量参考图像的生成场景,这使我们难以衡量模型在不同多参考条件下的性能进展,也无法准确指出其薄弱环节。此外,现有任务定义仍较为模糊,通常局限于"编辑对象"或"参考数量"等维度,未能捕捉多参考场景的内在难度。为填补这一空白,我们推出MultiBanana基准,通过大规模覆盖多参考特有问题来系统评估模型能力边界:(1)参考图像数量变化;(2)参考图像间的领域不匹配(如照片与动漫风格);(3)参考场景与目标场景的尺度差异;(4)包含罕见概念的参考图像(如红色香蕉);(5)多语言文本参考的渲染需求。我们对多种文本到图像模型的综合分析揭示了其优势表现、典型失败模式及改进方向。MultiBanana将作为开放基准发布,以推动多参考图像生成领域的发展,并为公平比较建立标准化基础。数据与代码详见https://github.com/matsuolab/multibanana。
English
Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce MultiBanana, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at https://github.com/matsuolab/multibanana .