MultiBanana：面向多參考文本生成圖像的挑戰性基準測試

摘要

近期，文本到圖像生成模型已具備多參考生成與編輯能力——能夠從多張參考圖像中繼承主體外觀，並在新情境下重新渲染。然而現有基準數據集通常側重於單一或少數參考圖像的生成，這使我們難以衡量模型在不同多參考條件下的性能進展，也無法準確指出其弱點。此外，現有任務定義仍顯模糊，通常侷限於「編輯對象」或「參考數量」等維度，未能捕捉多參考設置的內在難度。為填補此空白，我們提出MultiBanana基準集，其透過大規模覆蓋多參考特定問題來系統性評估模型能力邊界：(1)參考數量變化、(2)參考圖像間的領域不匹配（如照片與動漫風格）、(3)參考場景與目標場景的尺度差異、(4)包含罕見概念的參考圖像（如紅色香蕉）、(5)多語言文本參考的渲染需求。我們對多類文本到圖像模型的系統性分析揭示了其優勢表現、典型失敗模式及改進方向。MultiBanana將作為開放基準發布，以推動多參考圖像生成領域發展，並建立公平比較的標準化基礎。數據與程式碼已開源於：https://github.com/matsuolab/multibanana。

English

Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce MultiBanana, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at https://github.com/matsuolab/multibanana .

MultiBanana：面向多參考文本生成圖像的挑戰性基準測試

MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

摘要

Support