UniPercept:邁向跨美學、畫質、結構與紋理的統一感知層級圖像理解
UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture
December 25, 2025
作者: Shuo Cao, Jiayang Li, Xiaohui Li, Yuandong Pu, Kaiwen Zhu, Yuanting Gao, Siqi Luo, Yi Xin, Qi Qin, Yu Zhou, Xiangyu Chen, Wenlong Zhang, Bin Fu, Yu Qiao, Yihao Liu
cs.AI
摘要
多模態大型語言模型(MLLMs)在視覺定位、分割與描述等視覺理解任務中取得了顯著進展,但其對感知層級圖像特徵的認知能力仍存在侷限。本研究提出UniPercept-Bench,一個針對美學、畫質、結構與紋理三大關鍵領域的感知層級圖像統一理解框架。我們建立層級化定義體系並構建大規模數據集,用以評估感知層級圖像理解能力。在此基礎上,通過領域自適應預訓練與任務對齊強化學習,開發出具有強泛化能力的基準模型UniPercept,該模型在視覺評分(VR)與視覺問答(VQA)任務中均表現優異。UniPercept在感知層級圖像理解任務上超越現有MLLMs,並可作為文字生成圖像任務的即插即用獎勵模型。本研究界定了MLLM時代的感知層級圖像理解範疇,並通過引入綜合性基準與強基準模型,為推進感知層級多模態圖像理解奠定了堅實基礎。
English
Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptual-level image features remains limited. In this work, we present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture. We establish a hierarchical definition system and construct large-scale datasets to evaluate perceptual-level image understanding. Based on this foundation, we develop a strong baseline UniPercept trained via Domain-Adaptive Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks. UniPercept outperforms existing MLLMs on perceptual-level image understanding and can serve as a plug-and-play reward model for text-to-image generation. This work defines Perceptual-Level Image Understanding in the era of MLLMs and, through the introduction of a comprehensive benchmark together with a strong baseline, provides a solid foundation for advancing perceptual-level multimodal image understanding.