PhotoFlow: エージェント型3Dバーチャル写真撮影ミッション

要旨

バーチャルフォトグラフィは、エージェントに対して、カメラポーズや参照画像が事前に選択されていない準備済みの3Dシーンに入り、シーン情報と言語による意図から適切なショットを推論し、実行可能なカメラパラメータを選択して最終的な写真をレンダリングすることを要求する。近年の視覚言語モデルの進歩により、この種の空間エージェントはますます実現可能になりつつあるが、このタスクは、複雑な3D空間理解と抽象的な美的判断という、一緒に評価することが難しい二つの能力を強調する。我々は、閉ループカメラ探索のためのDirector-Reviewer-ReflectorエージェントであるPhotoFlowを導入する。Directorはソフトな写真設計図を構築し、多様な候補カメラを提案する。Reviewerはルールチェック、視覚的批評、およびペアワイズ現行選択を組み合わせる。Reflectorは失敗を領域メモリ、デッドゾーン抑制、高探索再配置に変換する。また、被写体配置、関係構図、雰囲気・スタイルにわたる47のオープンライセンスのBlenderシーンと141の言語条件付き写真撮影ミッションからなるベンチマークVPhotoBenchも導入する。ホールドアウト実験において、PhotoFlowは6ラウンドのレンダリング予算の下で、ワンショット予測、単一連鎖リフレクション、アンカーバンク選択、ランダム探索の中で最も高い外部品質-アラインメント複合スコアと成功率を達成した。我々の知る限り、これは任意のBlenderシーンにおける言語条件付きバーチャルフォトグラフィを実行可能なエージェントタスクとする初めての研究であり、我々の結果は、3D推論と美的選択の両方に挑戦するように設計された設定において、LLM中心の空間エージェントがすでに強力な写真を生成できることを示している。

English

Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.