ShutterMuse: MLLM을 활용한 실시간 사진 촬영 가이드

초록

실제 사진 촬영에서는 카메라 프레이밍과 피사체의 포즈 모두에 대해 촬영 시점의 가이드가 필요하다. 그러나 기존의 미적 크롭 벤치마크는 주로 사후 크롭 예측을 평가할 뿐 피사체 측 추천을 간과하여, 다중 모달 대규모 언어 모델(MLLM)의 촬영 시점 가이드 능력은 충분히 탐구되지 않았다. 이러한 격차를 해소하기 위해, 우리는 두 가지 상호 보완적 과제를 포함하는 CaptureGuide-Bench를 도입한다: 촬영자 측 구도 결정 및 개선, 그리고 피사체 측 장면 조건부 포즈 추천이다. 평가 결과, 일반 목적 MLLM은 구도 결정은 가능하지만 정밀한 개선 위치 파악에는 한계가 있으며, 전문 미적 크롭 모델은 크롭 위치를 효과적으로 파악하지만 개선에만 국한되어 있다는 한계가 드러났다. 두 유형 모두 실행 가능한 포즈 가이드를 제공하지 못한다. 모델 개발을 지원하기 위해, 우리는 텍스트 설명과 구조화된 시각적 주석을 포함한 130K 샘플로 구성된 CaptureGuide-Dataset을 구축하고, 지도 학습 및 강화 학습 미세 조정을 통해 통합된 MLLM인 ShutterMuse를 개발한다. CaptureGuide-Bench 실험에서 ShutterMuse는 평가된 기준 모델 중 가장 우수한 전반적 촬영자 측 성능을 달성하고, 경쟁력 있는 피사체 측 포즈 추천을 현저히 낮은 추론 비용으로 제공하여, 이미지 촬영 중 대화형 어시스턴트로서 MLLM의 잠재력을 입증한다.

English

Real-world photography requires capture-time guidance for both camera framing and subject pose. Yet existing aesthetic cropping benchmarks mainly evaluate post-hoc crop prediction and overlook subject-side recommendations, leaving the capture-time guidance capabilities of multimodal large language models (MLLMs) underexplored. To address this gap, we introduce CaptureGuide-Bench, a benchmark with two complementary tasks: photographer-side composition decision and refinement, and subject-side scene-conditioned pose recommendation. Our evaluation reveals limitations: general-purpose MLLMs can make composition decisions but lack precise refinement localization, while specialized aesthetic cropping models localize crops effectively but are limited to refinement; neither provides actionable pose guidance. To support model development, we further construct CaptureGuide-Dataset, comprising 130K samples with textual rationales and structured visual annotations, and develop ShutterMuse, a unified MLLM trained with supervised and reinforcement fine-tuning. Experiments on CaptureGuide-Bench show that ShutterMuse achieves the best overall photographer-side performance among evaluated baselines and competitive subject-side pose recommendation with substantially lower inference cost, demonstrating the potential of MLLMs as interactive assistants for photography during image capture.