MJ-Bench: 텍스트-이미지 생성을 위한 다중모달 보상 모델이 정말로 훌륭한 평가자인가?

초록

DALLE-3 및 Stable Diffusion과 같은 텍스트-이미지 모델이 빠르게 확산되고 있지만, 이러한 모델들은 종종 환각, 편향, 그리고 안전하지 않거나 저품질의 출력물을 생성하는 문제에 직면합니다. 이러한 문제를 효과적으로 해결하기 위해서는 다중모달 판단자(multimodal judge)의 피드백을 바탕으로 이러한 모델들을 원하는 행동에 맞추는 것이 중요합니다. 그러나 현재의 다중모달 판단자들은 그들의 능력과 한계에 대한 충분한 평가를 받지 못하는 경우가 많아, 잘못된 정렬과 안전하지 않은 미세조정 결과를 초래할 수 있습니다. 이 문제를 해결하기 위해, 우리는 MJ-Bench라는 새로운 벤치마크를 소개합니다. 이 벤치마크는 정렬, 안전성, 이미지 품질, 그리고 편향이라는 네 가지 주요 관점에서 이미지 생성 모델에 대한 피드백을 제공하는 다중모달 판단자들을 평가하기 위한 포괄적인 선호도 데이터셋을 포함합니다. 구체적으로, 우리는 더 작은 크기의 CLIP 기반 채점 모델, 오픈소스 VLM(예: LLaVA 패밀리), 그리고 클로즈드소스 VLM(예: GPT-4o, Claude 3)을 포함한 다양한 다중모달 판단자들을 우리의 선호도 데이터셋의 각 하위 범주에서 평가합니다. 실험 결과, 클로즈드소스 VLM이 일반적으로 더 나은 피드백을 제공하며, GPT-4o가 평균적으로 다른 판단자들을 능가하는 것으로 나타났습니다. 오픈소스 VLM과 비교했을 때, 더 작은 크기의 채점 모델들은 텍스트-이미지 정렬과 이미지 품질에 대해 더 나은 피드백을 제공하는 반면, VLM은 더 강력한 추론 능력으로 인해 안전성과 생성 편향에 대해 더 정확한 피드백을 제공합니다. 피드백 스케일에 대한 추가 연구는 VLM 판단자들이 숫자 스케일보다 자연어(Likert-scale)에서 더 정확하고 안정적인 피드백을 제공할 수 있음을 보여줍니다. 특히, 이러한 다중모달 판단자들의 개별 피드백을 사용하여 엔드투엔드로 미세조정된 모델에 대한 인간 평가는 유사한 결론을 제공하며, 이는 MJ-Bench의 효과를 더욱 확인시켜 줍니다. 모든 데이터, 코드, 모델은 https://huggingface.co/MJ-Bench에서 확인할 수 있습니다.

English

While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs (e.g. LLaVA family), and close-source VLMs (e.g. GPT-4o, Claude 3) on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language (Likert-scale) than numerical scales. Notably, human evaluations on end-to-end fine-tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-Bench. All data, code, models are available at https://huggingface.co/MJ-Bench.

MJ-Bench: 텍스트-이미지 생성을 위한 다중모달 보상 모델이 정말로 훌륭한 평가자인가?

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

초록

Support