GIR-Bench:多功能图像生成推理基准测试平台
GIR-Bench: Versatile Benchmark for Generating Images with Reasoning
October 13, 2025
作者: Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen
cs.AI
摘要
统一多模态模型融合了大型语言模型的推理能力与图像理解及生成能力,展现出高级多模态智能的巨大潜力。然而,当前领域仍缺乏一个严谨的以推理为中心的基准,来系统评估理解与生成之间的对齐关系,以及它们在复杂视觉任务中的泛化潜力。为此,我们推出了GIR-Bench,一个从三个互补角度全面评估统一模型的基准。首先,我们探究理解与生成的一致性(GIR-Bench-UGC),即模型能否在理解和生成任务中一致地运用相同知识。其次,我们考察模型是否能够执行以推理为中心的文本到图像生成,这要求应用逻辑约束和隐含知识来生成忠实于内容的视觉表达(GIR-Bench-T2I)。第三,我们评估模型在处理多步推理编辑任务中的表现(GIR-Bench-Edit)。针对每个子集,我们精心设计了适应各自任务的评估流程,这不仅实现了细粒度且可解释的评估,还有效缓解了当前流行的MLLM-as-a-Judge范式可能带来的偏差。通过对多种统一模型及仅生成系统的广泛消融实验发现:尽管统一模型在处理推理驱动的视觉任务上更为出色,但它们在理解与生成之间仍存在持续的差距。GIR-Bench的数据与代码已公开于https://hkust-longgroup.github.io/GIR-Bench{https://hkust-longgroup.github.io/GIR-Bench}。
English
Unified multimodal models integrate the reasoning capacity of large language
models with both image understanding and generation, showing great promise for
advanced multimodal intelligence. However, the community still lacks a rigorous
reasoning-centric benchmark to systematically evaluate the alignment between
understanding and generation, and their generalization potential in complex
visual tasks. To this end, we introduce GIR-Bench, a comprehensive
benchmark that evaluates unified models across three complementary
perspectives. Firstly, we investigate understanding-generation consistency
(GIR-Bench-UGC), asking whether models can consistently leverage the same
knowledge in both understanding and generation tasks. Secondly, we investigate
whether models can perform reasoning-centric text-to-image generation that
requires applying logical constraints and implicit knowledge to generate
faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models
can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset,
we carefully design different task-specific evaluation pipelines tailored for
each task. This enables fine-grained and interpretable evaluation while
mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive
ablations over various unified models and generation-only systems have shown
that: Although unified models are more capable of reasoning-driven visual
tasks, they still exhibit a persistent gap between understanding and
generation. The data and code for GIR-Bench are available at
https://hkust-longgroup.github.io/GIR-Bench{https://hkust-longgroup.github.io/GIR-Bench}.