ChatPaper.aiChatPaper

下一個標記足矣:利用多模態大型語言模型實現逼真圖像質量與美學評分

Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model

March 8, 2025
作者: Mingxing Li, Rui Wang, Lei Sun, Yancheng Bai, Xiangxiang Chu
cs.AI

摘要

移動互聯網的快速擴展導致用戶生成內容(UGC)圖像大幅增加,這使得對UGC圖像的全面評估變得既迫切又必要。近年來,多模態大語言模型(MLLMs)在圖像質量評估(IQA)和圖像美學評估(IAA)方面展現出巨大潛力。儘管取得了這些進展,有效評分UGC圖像的質量和美學仍面臨兩大挑戰:1)單一評分不足以捕捉人類感知的層次性;2)如何利用MLLMs輸出數值評分,如平均意見分數(MOS),仍是一個未解之題。為應對這些挑戰,我們引入了一個名為真實圖像質量與美學(RealQA)的新數據集,包含14,715張UGC圖像,每張圖像都標註了10個細粒度屬性。這些屬性涵蓋三個層次:低層次(如圖像清晰度)、中層次(如主體完整性)和高層次(如構圖)。此外,我們對如何有效利用MLLMs預測數值評分進行了一系列深入全面的研究。令人驚訝的是,僅通過預測兩個額外有效數字,下一個標記範式就能達到SOTA性能。進一步地,借助思維鏈(CoT)結合學習到的細粒度屬性,所提出的方法在五個公開的IQA和IAA數據集上超越了SOTA方法,具有優越的解釋性,並在視頻質量評估(VQA)中展現出強大的零樣本泛化能力。代碼和數據集將被公開。
English
The rapid expansion of mobile internet has resulted in a substantial increase in user-generated content (UGC) images, thereby making the thorough assessment of UGC images both urgent and essential. Recently, multimodal large language models (MLLMs) have shown great potential in image quality assessment (IQA) and image aesthetic assessment (IAA). Despite this progress, effectively scoring the quality and aesthetics of UGC images still faces two main challenges: 1) A single score is inadequate to capture the hierarchical human perception. 2) How to use MLLMs to output numerical scores, such as mean opinion scores (MOS), remains an open question. To address these challenges, we introduce a novel dataset, named Realistic image Quality and Aesthetic (RealQA), including 14,715 UGC images, each of which is annoted with 10 fine-grained attributes. These attributes span three levels: low level (e.g., image clarity), middle level (e.g., subject integrity) and high level (e.g., composition). Besides, we conduct a series of in-depth and comprehensive investigations into how to effectively predict numerical scores using MLLMs. Surprisingly, by predicting just two extra significant digits, the next token paradigm can achieve SOTA performance. Furthermore, with the help of chain of thought (CoT) combined with the learnt fine-grained attributes, the proposed method can outperform SOTA methods on five public datasets for IQA and IAA with superior interpretability and show strong zero-shot generalization for video quality assessment (VQA). The code and dataset will be released.

Summary

AI-Generated Summary

PDF32March 12, 2025