ChatPaper.aiChatPaper

GIR-Bench:多功能圖像生成與推理基準測試平台

GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

October 13, 2025
作者: Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen
cs.AI

摘要

統一多模態模型整合了大型語言模型的推理能力與圖像理解及生成能力,展現出高級多模態智能的巨大潛力。然而,學術界仍缺乏一個嚴謹的以推理為核心的基準,來系統性地評估理解與生成之間的對齊性,以及它們在複雜視覺任務中的泛化潛力。為此,我們引入了GIR-Bench,這是一個全面的基準,從三個互補的角度評估統一模型。首先,我們探討理解與生成的一致性(GIR-Bench-UGC),即模型是否能在理解與生成任務中一致地利用相同的知識。其次,我們研究模型是否能進行以推理為中心的文本到圖像生成,這需要應用邏輯約束和隱含知識來生成忠實的視覺內容(GIR-Bench-T2I)。第三,我們評估模型是否能在編輯中處理多步推理(GIR-Bench-Edit)。對於每個子集,我們精心設計了針對特定任務的評估流程,這使得評估更加細緻且可解釋,同時減少了現行MLLM-as-a-Judge範式帶來的偏見。對各種統一模型和僅生成系統的廣泛消融實驗表明:儘管統一模型在推理驅動的視覺任務中表現更為出色,但它們在理解與生成之間仍存在持續的差距。GIR-Bench的數據和代碼可在https://hkust-longgroup.github.io/GIR-Bench{https://hkust-longgroup.github.io/GIR-Bench}獲取。
English
Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce GIR-Bench, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at https://hkust-longgroup.github.io/GIR-Bench{https://hkust-longgroup.github.io/GIR-Bench}.
PDF173October 14, 2025