ChatPaper.aiChatPaper

UniPercept:面向美学、质量、结构与纹理的统一感知级图像理解

UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture

December 25, 2025
作者: Shuo Cao, Jiayang Li, Xiaohui Li, Yuandong Pu, Kaiwen Zhu, Yuanting Gao, Siqi Luo, Yi Xin, Qi Qin, Yu Zhou, Xiangyu Chen, Wenlong Zhang, Bin Fu, Yu Qiao, Yihao Liu
cs.AI

摘要

多模态大语言模型(MLLMs)在视觉定位、分割与描述等视觉理解任务中取得显著进展,但其对感知级图像特征的认知能力仍存在局限。本研究提出UniPercept-Bench——一个面向美学、质量、结构与纹理三大关键领域的统一感知级图像理解框架。我们建立了层次化定义体系并构建大规模数据集以评估感知级图像理解能力。基于此基础,通过领域自适应预训练与任务对齐强化学习开发出强基线模型UniPercept,该模型在视觉评分(VR)和视觉问答(VQA)任务中均展现出强大的泛化能力。UniPercept在感知级图像理解任务上超越现有MLLMs,并可作为即插即用的奖励模型用于文本到图像生成。本研究界定了MLLM时代的感知级图像理解范畴,通过引入综合性基准与强基线模型,为推进感知级多模态图像理解奠定了坚实基础。
English
Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptual-level image features remains limited. In this work, we present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture. We establish a hierarchical definition system and construct large-scale datasets to evaluate perceptual-level image understanding. Based on this foundation, we develop a strong baseline UniPercept trained via Domain-Adaptive Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks. UniPercept outperforms existing MLLMs on perceptual-level image understanding and can serve as a plug-and-play reward model for text-to-image generation. This work defines Perceptual-Level Image Understanding in the era of MLLMs and, through the introduction of a comprehensive benchmark together with a strong baseline, provides a solid foundation for advancing perceptual-level multimodal image understanding.
PDF192December 30, 2025