摄影师之眼：教导多模态大语言模型像摄影师一样观察与评析

摘要

在直接取材于生活进行编辑时，摄影师们发现同时捕捉“蓝”与“天空”的视觉感知颇为困难。摄影师兼策展人Szarkowski敏锐地揭示了一般视觉理解与美学视觉理解之间的一大显著差异：前者侧重于识别图像中的实体元素（如天空），而后者则超越了这种对象识别，将其视为美学构成——一块纯粹的色彩（蓝）。这种一般视觉理解（检测、定位等）与美学视觉理解（色彩、光影、构图等）之间的根本区别，对多模态大语言模型（MLLMs）构成了重大挑战。尽管近期一些研究已进行了初步探索，但它们往往局限于一般及基础的美学常识，因此在现实场景中（如图1所示）常显不足，这些场景需要深厚的专业知识——包括摄影技巧、照片前后期处理知识等，以提供详尽的分析与描述。为了从根本上提升MLLMs的美学理解能力，我们首先引入了一个新颖的数据集——PhotoCritique，该数据集源自专业摄影师与爱好者间的广泛讨论，具有大规模、专业性和多样性的特点。随后，为了更好地从PhotoCritique中学习视觉美学，我们进一步提出了一种新模型——PhotoEye，它采用语言引导的多视角视觉融合机制，从多个角度理解图像美学。最后，我们推出了一个全新的基准测试——PhotoBench，这是一个全面且专业的美学视觉理解基准。在现有基准及PhotoBench上，我们的模型相较于现有模型展现出了明显的优势。

English

While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky. Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component--a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise--including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.

摄影师之眼：教导多模态大语言模型像摄影师一样观察与评析

The Photographer Eye: Teaching Multimodal Large Language Models to See and Critique like Photographers

摘要

Support