ChatPaper.aiChatPaper

从像素到情感:对齐多模态大模型与人类图像认知感知

From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

November 27, 2025
作者: Yiming Chen, Junlin Han, Tianyi Bai, Shengbang Tong, Filippos Kokkinos, Philip Torr
cs.AI

摘要

尽管多模态大语言模型(MLLMs)擅长回答图像内容——识别物体和描述场景——但它们往往缺乏对人类观察者感知图像情感的能力。这种差距在考量主观认知属性时尤为明显,例如图像为何令人难忘、有趣、具有美感或能引发情感共鸣。为系统性地解决这一挑战,我们推出了CogIP-Bench,这是一个用于评估MLLMs在此类图像认知属性上的综合基准。我们的评估揭示了一个显著差距:当前模型与人类对这些微妙属性的感知存在严重偏差。随后我们证明,通过后训练阶段能有效弥合这一差距,显著提升模型与人类判断的契合度。此外,研究发现这种习得的认知对齐不仅具有预测性,还能迁移至下游创意任务。通过将认知对齐的MLLM集成到图像生成流程中,我们可以引导合成过程生成更能体现预期特质(如更令人难忘或更具视觉吸引力)的图像。本研究不仅提供了衡量这种类人感知的基准、增强该能力的后训练流程,更通过实践验证了这种对齐能够开启更具人性化的人工智能应用。
English
While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model's alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but also transferable to downstream creative tasks. By integrating our cognitively-aligned MLLM into an image generation pipeline, we can guide the synthesis process to produce images that better embody desired traits, such as being more memorable or visually appealing. Our work provides a benchmark to measure this human-like perception, a post-training pipeline to enhance it, and a demonstration that this alignment unlocks more human-centric AI.
PDF21December 2, 2025