MJ-Bench：您的多模態獎勵模型真的是評估文本到圖像生成的良好標準嗎？

摘要

儘管像 DALLE-3 和 Stable Diffusion 這樣的文本到圖像模型迅速增加，但它們常常面臨幻覺、偏見和生成不安全、低質量輸出等挑戰。為了有效應對這些問題，關鍵在於根據多模態評判的反饋，將這些模型與期望的行為相一致。儘管這些模型的重要性不言而喻，但目前的多模態評判經常接受不足的能力和局限性評估，可能導致不一致和不安全的微調結果。為了解決這個問題，我們引入了 MJ-Bench，這是一個新穎的基準測試，它結合了一個全面的偏好數據集，以評估多模態評判在提供圖像生成模型反饋方面的四個關鍵角度：一致性、安全性、圖像質量和偏見。具體而言，我們對各種多模態評判進行評估，包括基於 CLIP 的小型評分模型、開源 VLMs（例如 LLaVA 家族）和封閉源 VLMs（例如 GPT-4o、Claude 3），對我們偏好數據集的每個分解子類別進行評估。實驗顯示，封閉源 VLMs 通常提供更好的反饋，其中 GPT-4o 在平均值上表現優異。與開源 VLMs 相比，小型評分模型在文本-圖像一致性和圖像質量方面能夠提供更好的反饋，而 VLMs 則由於其更強的推理能力，能夠提供更準確的安全性和生成偏見反饋。進一步的反饋規模研究顯示，VLM 評判在自然語言（Likert 標度）上通常能夠提供比數值標度更準確和穩定的反饋。值得注意的是，使用這些多模態評判的單獨反饋對端到端微調模型進行的人類評估得出了類似的結論，進一步確認了 MJ-Bench 的有效性。所有數據、代碼和模型均可在 https://huggingface.co/MJ-Bench 上找到。

English

While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs (e.g. LLaVA family), and close-source VLMs (e.g. GPT-4o, Claude 3) on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language (Likert-scale) than numerical scales. Notably, human evaluations on end-to-end fine-tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-Bench. All data, code, models are available at https://huggingface.co/MJ-Bench.

MJ-Bench：您的多模態獎勵模型真的是評估文本到圖像生成的良好標準嗎？

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

摘要

Support