MJ-Bench:您的多模態獎勵模型真的是評估文本到圖像生成的良好標準嗎?
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
July 5, 2024
作者: Zhaorun Chen, Yichao Du, Zichen Wen, Yiyang Zhou, Chenhang Cui, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, Canyu Chen, Qinghao Ye, Zhihong Zhu, Yuqing Zhang, Jiawei Zhou, Zhuokai Zhao, Rafael Rafailov, Chelsea Finn, Huaxiu Yao
cs.AI
摘要
儘管像 DALLE-3 和 Stable Diffusion 這樣的文本到圖像模型迅速增加,但它們常常面臨幻覺、偏見和生成不安全、低質量輸出等挑戰。為了有效應對這些問題,關鍵在於根據多模態評判的反饋,將這些模型與期望的行為相一致。儘管這些模型的重要性不言而喻,但目前的多模態評判經常接受不足的能力和局限性評估,可能導致不一致和不安全的微調結果。為了解決這個問題,我們引入了 MJ-Bench,這是一個新穎的基準測試,它結合了一個全面的偏好數據集,以評估多模態評判在提供圖像生成模型反饋方面的四個關鍵角度:一致性、安全性、圖像質量和偏見。具體而言,我們對各種多模態評判進行評估,包括基於 CLIP 的小型評分模型、開源 VLMs(例如 LLaVA 家族)和封閉源 VLMs(例如 GPT-4o、Claude 3),對我們偏好數據集的每個分解子類別進行評估。實驗顯示,封閉源 VLMs 通常提供更好的反饋,其中 GPT-4o 在平均值上表現優異。與開源 VLMs 相比,小型評分模型在文本-圖像一致性和圖像質量方面能夠提供更好的反饋,而 VLMs 則由於其更強的推理能力,能夠提供更準確的安全性和生成偏見反饋。進一步的反饋規模研究顯示,VLM 評判在自然語言(Likert 標度)上通常能夠提供比數值標度更準確和穩定的反饋。值得注意的是,使用這些多模態評判的單獨反饋對端到端微調模型進行的人類評估得出了類似的結論,進一步確認了 MJ-Bench 的有效性。所有數據、代碼和模型均可在 https://huggingface.co/MJ-Bench 上找到。
English
While text-to-image models like DALLE-3 and Stable Diffusion are rapidly
proliferating, they often encounter challenges such as hallucination, bias, and
the production of unsafe, low-quality output. To effectively address these
issues, it is crucial to align these models with desired behaviors based on
feedback from a multimodal judge. Despite their significance, current
multimodal judges frequently undergo inadequate evaluation of their
capabilities and limitations, potentially leading to misalignment and unsafe
fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel
benchmark which incorporates a comprehensive preference dataset to evaluate
multimodal judges in providing feedback for image generation models across four
key perspectives: alignment, safety, image quality, and bias. Specifically, we
evaluate a large variety of multimodal judges including smaller-sized
CLIP-based scoring models, open-source VLMs (e.g. LLaVA family), and
close-source VLMs (e.g. GPT-4o, Claude 3) on each decomposed subcategory of our
preference dataset. Experiments reveal that close-source VLMs generally provide
better feedback, with GPT-4o outperforming other judges in average. Compared
with open-source VLMs, smaller-sized scoring models can provide better feedback
regarding text-image alignment and image quality, while VLMs provide more
accurate feedback regarding safety and generation bias due to their stronger
reasoning capabilities. Further studies in feedback scale reveal that VLM
judges can generally provide more accurate and stable feedback in natural
language (Likert-scale) than numerical scales. Notably, human evaluations on
end-to-end fine-tuned models using separate feedback from these multimodal
judges provide similar conclusions, further confirming the effectiveness of
MJ-Bench. All data, code, models are available at
https://huggingface.co/MJ-Bench.Summary
AI-Generated Summary