M3-AGIQA:多模态、多轮次、多面向的AI生成图像质量评估
M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment
February 21, 2025
作者: Chuan Cui, Kejiang Chen, Zhihua Wei, Wen Shen, Weiming Zhang, Nenghai Yu
cs.AI
摘要
AI生成圖像(AGI)模型的快速發展,在評估其品質方面帶來了重大挑戰,這需要考慮多個維度,如感知品質、提示對應性和真實性。為應對這些挑戰,我們提出了M3-AGIQA,一個多模態、多輪次、多方面的AGI品質評估綜合框架。我們的方法利用多模態大型語言模型(MLLMs)作為聯合文本和圖像編碼器,並通過低秩適應(LoRA)微調將線上MLLMs的高級描述能力蒸餾到本地模型中。該框架包含一個結構化的多輪評估機制,其中生成中間圖像描述以提供對品質、對應性和真實性方面的深入洞察。為了使預測與人類感知判斷一致,我們整合了一個由xLSTM和回歸頭構建的預測器,用於處理序列邏輯並預測平均意見分數(MOSs)。在多個基準數據集上進行的廣泛實驗表明,M3-AGIQA達到了最先進的性能,有效捕捉了AGI品質的細微差別。此外,跨數據集驗證證實了其強大的泛化能力。代碼可在https://github.com/strawhatboy/M3-AGIQA 獲取。
English
The rapid advancement of AI-generated image (AGI) models has introduced
significant challenges in evaluating their quality, which requires considering
multiple dimensions such as perceptual quality, prompt correspondence, and
authenticity. To address these challenges, we propose M3-AGIQA, a comprehensive
framework for AGI quality assessment that is Multimodal, Multi-Round, and
Multi-Aspect. Our approach leverages the capabilities of Multimodal Large
Language Models (MLLMs) as joint text and image encoders and distills advanced
captioning capabilities from online MLLMs into a local model via Low-Rank
Adaptation (LoRA) fine-tuning. The framework includes a structured multi-round
evaluation mechanism, where intermediate image descriptions are generated to
provide deeper insights into the quality, correspondence, and authenticity
aspects. To align predictions with human perceptual judgments, a predictor
constructed by an xLSTM and a regression head is incorporated to process
sequential logits and predict Mean Opinion Scores (MOSs). Extensive experiments
conducted on multiple benchmark datasets demonstrate that M3-AGIQA achieves
state-of-the-art performance, effectively capturing nuanced aspects of AGI
quality. Furthermore, cross-dataset validation confirms its strong
generalizability. The code is available at
https://github.com/strawhatboy/M3-AGIQA.Summary
AI-Generated Summary