GIR-Bench: Benchmark Versátil para Geração de Imagens com Raciocínio

Resumo

Modelos multimodais unificados integram a capacidade de raciocínio de grandes modelos de linguagem com a compreensão e geração de imagens, demonstrando grande potencial para inteligência multimodal avançada. No entanto, a comunidade ainda carece de um benchmark rigoroso e centrado em raciocínio para avaliar sistematicamente o alinhamento entre compreensão e geração, bem como seu potencial de generalização em tarefas visuais complexas. Para isso, introduzimos o GIR-Bench, um benchmark abrangente que avalia modelos unificados em três perspectivas complementares. Primeiramente, investigamos a consistência entre compreensão e geração (GIR-Bench-UGC), questionando se os modelos podem utilizar consistentemente o mesmo conhecimento em tarefas de compreensão e geração. Em segundo lugar, investigamos se os modelos podem realizar geração de texto para imagem centrada em raciocínio, que exige a aplicação de restrições lógicas e conhecimento implícito para gerar conteúdo visual fiel (GIR-Bench-T2I). Em terceiro lugar, avaliamos se os modelos conseguem lidar com raciocínio em múltiplas etapas durante a edição (GIR-Bench-Edit). Para cada subconjunto, projetamos cuidadosamente diferentes pipelines de avaliação específicos para cada tarefa. Isso permite uma avaliação detalhada e interpretável, ao mesmo tempo que mitiga vieses do paradigma prevalente de MLLM-como-Juiz. Ablações extensas em vários modelos unificados e sistemas de geração exclusiva mostraram que: Embora os modelos unificados sejam mais capazes em tarefas visuais orientadas por raciocínio, eles ainda exibem uma lacuna persistente entre compreensão e geração. Os dados e o código do GIR-Bench estão disponíveis em https://hkust-longgroup.github.io/GIR-Bench{https://hkust-longgroup.github.io/GIR-Bench}.

English

Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce GIR-Bench, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at https://hkust-longgroup.github.io/GIR-Bench{https://hkust-longgroup.github.io/GIR-Bench}.

GIR-Bench: Benchmark Versátil para Geração de Imagens com Raciocínio

GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

Resumo

Support