攝影師之眼：教導多模態大型語言模型如何像攝影師般觀察與評析

摘要

在直接從生活中進行編輯時，攝影師們發現同時看到藍色和天空過於困難。攝影師兼策展人Szarkowski敏銳地揭示了普遍視覺理解與美學視覺理解之間的一個顯著差距：前者專注於識別圖像中的事實元素（天空），而後者則超越了這種對象識別，將其視為一種美學成分——純粹的色塊（藍色）。這種普遍（檢測、定位等）與美學（色彩、光影、構圖等）視覺理解之間的根本區別，對多模態大語言模型（MLLMs）提出了重大挑戰。儘管近期的一些工作已進行了初步探索，但它們往往侷限於普遍且基本的美學常識。因此，在現實場景中（圖1），它們常常力不從心，這些場景需要廣泛的專業知識——包括攝影技巧、照片前後期處理知識等，以提供詳細的分析和描述。為了從根本上提升MLLMs的美學理解能力，我們首先引入了一個新穎的數據集PhotoCritique，該數據集源自專業攝影師和愛好者之間的廣泛討論，並以其大規模、專業性和多樣性為特徵。接著，為了更好地從PhotoCritique中學習視覺美學，我們進一步提出了一種新模型PhotoEye，該模型採用了一種語言引導的多視角視覺融合機制，從多個角度理解圖像美學。最後，我們提出了一個新基準PhotoBench，這是一個全面且專業的美學視覺理解基準。在現有基準和PhotoBench上，我們的模型展現出相較於現有模型的明顯優勢。

English

While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky. Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component--a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise--including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.

攝影師之眼：教導多模態大型語言模型如何像攝影師般觀察與評析

The Photographer Eye: Teaching Multimodal Large Language Models to See and Critique like Photographers

摘要

Support