MJ-Bench：您的多模态奖励模型真的是文本到图像生成的良好评判标准吗？

摘要

尽管像DALLE-3和稳定扩散这样的文本到图像模型正在迅速增多，但它们经常面临幻觉、偏见和生成不安全、低质量输出等挑战。要有效解决这些问题，关键是根据来自多模态评委的反馈，将这些模型与期望的行为进行对齐。尽管它们的重要性，当前的多模态评委经常接受其能力和局限性的不充分评估，可能导致不对齐和不安全的微调结果。为了解决这个问题，我们引入了MJ-Bench，这是一个新颖的基准，它结合了一个全面的偏好数据集，用于评估多模态评委在为图像生成模型提供反馈方面的四个关键视角：对齐、安全、图像质量和偏见。具体而言，我们评估了大量多模态评委，包括基于较小规模的CLIP评分模型、开源VLMs（例如LLaVA系列）和闭源VLMs（例如GPT-4o、Claude 3），在我们偏好数据集的每个分解子类别上进行评估。实验表明，闭源VLMs通常提供更好的反馈，其中GPT-4o在平均值上表现优异。与开源VLMs相比，较小规模的评分模型在文本-图像对齐和图像质量方面可以提供更好的反馈，而VLMs由于其更强的推理能力，在安全性和生成偏见方面提供更准确的反馈。在反馈规模方面的进一步研究显示，VLM评委通常可以比数字规模在自然语言（Likert规模）上提供更准确和稳定的反馈。值得注意的是，使用这些多模态评委的单独反馈对端到端微调模型进行的人类评估得出了类似的结论，进一步确认了MJ-Bench的有效性。所有数据、代码、模型均可在https://huggingface.co/MJ-Bench 获取。

English

While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs (e.g. LLaVA family), and close-source VLMs (e.g. GPT-4o, Claude 3) on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language (Likert-scale) than numerical scales. Notably, human evaluations on end-to-end fine-tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-Bench. All data, code, models are available at https://huggingface.co/MJ-Bench.

MJ-Bench：您的多模态奖励模型真的是文本到图像生成的良好评判标准吗？

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

摘要

Support