ShutterMuse: MLLMによる撮影時写真ガイダンス

要旨

実世界の写真撮影では、フレーミングと被写体のポーズの両方について、撮影時にガイダンスを提供する必要がある。しかし既存の美学的クロップベンチマークは主に事後的なクロップ予測を評価するものであり、被写体側の推奨を無視しており、マルチモーダル大規模言語モデル（MLLM）の撮影時ガイダンス能力は十分に探求されていない。このギャップを解消するために、我々はCaptureGuide-Benchを提案する。これは、撮影者側の構図決定・改善と、被写体側のシーン条件付きポーズ推薦という2つの相補的なタスクからなるベンチマークである。評価の結果、汎用MLLMは構図決定はできるが精密な改善の位置特定ができず、一方、専門的な美学的クロップモデルはクロップの位置特定は効果的に行えるが改善に限定されており、いずれも実用的なポーズガイダンスを提供できないことが明らかになった。モデル開発を支援するため、我々はさらに、テキストによる根拠と構造化された視覚的注釈を含む13万サンプルからなるCaptureGuide-Datasetを構築し、教師あり学習と強化学習によるファインチューニングを施した統合MLLMであるShutterMuseを開発した。CaptureGuide-Benchでの実験により、ShutterMuseは評価ベースラインの中で撮影者側の全体的な性能が最も優れ、被写体側のポーズ推薦でも競争力のある性能をはるかに低い推論コストで達成し、画像撮影時のインタラクティブアシスタントとしてのMLLMの可能性を示している。

English

Real-world photography requires capture-time guidance for both camera framing and subject pose. Yet existing aesthetic cropping benchmarks mainly evaluate post-hoc crop prediction and overlook subject-side recommendations, leaving the capture-time guidance capabilities of multimodal large language models (MLLMs) underexplored. To address this gap, we introduce CaptureGuide-Bench, a benchmark with two complementary tasks: photographer-side composition decision and refinement, and subject-side scene-conditioned pose recommendation. Our evaluation reveals limitations: general-purpose MLLMs can make composition decisions but lack precise refinement localization, while specialized aesthetic cropping models localize crops effectively but are limited to refinement; neither provides actionable pose guidance. To support model development, we further construct CaptureGuide-Dataset, comprising 130K samples with textual rationales and structured visual annotations, and develop ShutterMuse, a unified MLLM trained with supervised and reinforcement fine-tuning. Experiments on CaptureGuide-Bench show that ShutterMuse achieves the best overall photographer-side performance among evaluated baselines and competitive subject-side pose recommendation with substantially lower inference cost, demonstrating the potential of MLLMs as interactive assistants for photography during image capture.