PhotoFlow：自主式3D虚拟摄影任务

摘要

虚拟摄影要求智能体进入一个预制的3D场景，在没有预设相机位姿或参考图像的情况下，根据场景信息和语言意图推断合适的镜头，选择可执行的相机参数，并最终渲染出照片。近年来视觉语言模型的进展使这类空间智能体越来越可行，但该任务对两种难以共同评估的能力提出了挑战：复杂的3D空间理解与抽象的审美判断。我们提出了PhotoFlow——一个用于闭环相机搜索的导演-评审-反思智能体。导演构建软性摄影蓝图并生成多样化的候选相机；评审结合规则检查、视觉评判和成对当前最优选择；反思则将失败转化为区域记忆、死区抑制和高探索性重定位。我们还引入了VPhotoBench基准，包含47个开放许可的Blender场景和141项语言驱动的摄影任务，涵盖主体布局、关系构图以及氛围/风格。在留出测试中，PhotoFlow在六轮渲染预算下，相较于一次性预测、单链反思、锚点库选择和随机搜索，实现了最强的外部质量-对齐复合指标和成功率。据我们所知，这是首项将任意Blender场景中的语言驱动虚拟摄影定义为可执行智能体任务的工作，我们的结果表明，以LLM为中心的空间智能体在旨在挑战3D推理与审美抉择的环境中，已能生成高质量的照片。

English

Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.