MMMG: 다중 작업 다중 모달 생성을 위한 포괄적이고 신뢰할 수 있는 평가 도구

초록

다중모달 생성의 자동 평가는 자동화된 지표가 인간 평가와 신뢰성 있게 일치하기 어려운, 특히 여러 모달리티를 포함하는 복잡한 작업에서 상당한 도전 과제를 제시한다. 이를 해결하기 위해, 우리는 4가지 모달리티 조합(이미지, 오디오, 텍스트와 이미지의 인터리브, 텍스트와 오디오의 인터리브)에 걸친 다중모달 생성을 위한 포괄적이고 인간 평가와 일치하는 벤치마크인 MMMG를 제안한다. 이 벤치마크는 생성 모델에게 상당한 도전을 제시하면서도 모델과 프로그램의 조합을 통해 신뢰할 수 있는 자동 평가를 가능하게 하는 데 초점을 맞추고 있다. MMMG는 49개의 작업(그중 29개가 새로 개발됨)을 포함하며, 각 작업은 신중하게 설계된 평가 파이프라인과 937개의 지시문을 통해 다중모달 생성 모델의 추론, 제어 가능성 및 기타 주요 능력을 체계적으로 평가한다. 광범위한 검증 결과, MMMG는 인간 평가와 높은 일치도를 보이며 평균 94.3%의 일치율을 달성했다. 24개의 다중모달 생성 모델에 대한 벤치마킹 결과는 최신 모델인 GPT Image가 이미지 생성에서 78.3%의 정확도를 달성했음에도 불구하고, 다중모달 추론 및 인터리브 생성에서는 부족함을 보여준다. 또한, 결과는 오디오 생성에서 상당한 개선 여지가 있음을 시사하며, 이는 향후 연구를 위한 중요한 방향을 강조한다.

English

Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.