フォトグラファーの眼：マルチモーダル大規模言語モデルに写真家のように見て批評することを教える

要旨

写真家たちは、現実を直接編集する際に、青と空を同時に見ることが非常に難しいと感じてきました。写真家兼キュレーターであるスザコウスキは、一般的な視覚理解と美的視覚理解の間にある顕著なギャップの一つを鋭く指摘しました。前者は画像内の事実的要素（空）を識別することに焦点を当てるのに対し、後者はそのような対象の識別を超え、それを美的要素——純粋な色のブロック（青）——として見るのです。このような一般的（検出、位置特定など）と美的（色、照明、構図など）な視覚理解の根本的な違いは、マルチモーダル大規模言語モデル（MLLMs）にとって大きな課題となっています。最近のいくつかの研究では初期の探求が行われていますが、それらはしばしば一般的で基本的な美的常識に限定されています。その結果、現実世界のシナリオ（図1）では、詳細な分析と説明を提供するために必要な広範な専門知識——写真技術、写真の前処理/後処理の知識など——を十分に満たすことができません。MLLMsの美的理解を根本的に向上させるために、私たちはまず、プロの写真家や愛好家の間での広範な議論から得られた新しいデータセット、PhotoCritiqueを紹介します。このデータセットは、大規模性、専門性、多様性を特徴としています。次に、PhotoCritiqueから視覚美学をより良く学ぶために、複数の視点から画像美学を理解するための言語誘導型マルチビュービジョンフュージョンメカニズムを備えた新しいモデル、PhotoEyeを提案します。最後に、美的視覚理解のための包括的で専門的なベンチマーク、PhotoBenchを提示します。既存のベンチマークとPhotoBenchにおいて、私たちのモデルは既存のモデルに対して明確な優位性を示しています。

English

While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky. Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component--a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise--including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.

フォトグラファーの眼：マルチモーダル大規模言語モデルに写真家のように見て批評することを教える

The Photographer Eye: Teaching Multimodal Large Language Models to See and Critique like Photographers

要旨

Support