사진가의 눈: 멀티모달 대형 언어 모델이 사진가처럼 보고 비평하도록 가르치기

초록

실제 생활에서 직접 편집을 하면서, 사진작가들은 하늘과 파란색을 동시에 보는 것이 너무 어렵다는 것을 발견했습니다. 사진작가이자 큐레이터인 스자르코프스키는 일반적인 시각적 이해와 미학적 시각적 이해 사이의 주목할 만한 차이점을 통찰력 있게 드러냈습니다: 전자는 이미지에서 사실적 요소(하늘)를 식별하는 데 초점을 맞추는 반면, 후자는 그러한 객체 식별을 초월하여 이를 미학적 구성 요소—순수한 색상 블록(파란색)—로 바라봅니다. 일반적인(탐지, 위치 지정 등) 시각적 이해와 미학적(색상, 조명, 구성 등) 시각적 이해 사이의 이러한 근본적인 차이는 다중 모드 대형 언어 모델(MLLMs)에게 상당한 도전 과제를 제시합니다. 최근 몇몇 연구들이 초기 탐구를 시도했지만, 이들은 종종 일반적이고 기본적인 미학적 상식에 국한되어 있습니다. 결과적으로, 이들은 실제 시나리오(그림 1)에서 자주 부족함을 보이는데, 이러한 시나리오는 사진 기술, 사진 전/후 처리 지식 등을 포함한 광범위한 전문 지식을 요구하며, 이를 통해 상세한 분석과 설명을 제공해야 합니다. MLLMs의 미학적 이해를 근본적으로 향상시키기 위해, 우리는 먼저 전문 사진작가와 애호가들 간의 광범위한 토론에서 도출된 대규모, 전문성, 다양성을 특징으로 하는 새로운 데이터셋인 PhotoCritique를 소개합니다. 그런 다음, PhotoCritique에서 시각적 미학을 더 잘 학습하기 위해, 우리는 다중 관점에서 이미지 미학을 이해하기 위한 언어-가이드 다중 시각 융합 메커니즘을 특징으로 하는 새로운 모델인 PhotoEye를 제안합니다. 마지막으로, 우리는 미학적 시각적 이해를 위한 포괄적이고 전문적인 벤치마크인 PhotoBench를 제시합니다. 기존 벤치마크와 PhotoBench에서, 우리의 모델은 기존 모델들에 비해 명확한 우위를 보여줍니다.

English

While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky. Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component--a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise--including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.