ChatPaper.aiChatPaper

PhotoFlow: 自主式3D虛擬攝影任務

PhotoFlow: Agentic 3D Virtual Photography Missions

May 22, 2026
作者: Jiarui Guo, Haojia Wei, Yiming Zhang, Yifei Liu, Yuning Gong, Hongjie Zhang, Xue Yang, Zhihang Zhong
cs.AI

摘要

虛擬攝影要求智能體進入一個預先準備好的三維場景,在沒有預設相機姿態或參考圖像的情況下,從場景資訊與語言意圖推斷出合適的構圖,選擇可執行的相機參數,並渲染出最終照片。近年來視覺-語言模型的進展,讓這類空間智能體變得越來越可行,但該任務同時考驗兩種難以共同評估的能力:複雜的三維空間理解與抽象的美學判斷。我們提出PhotoFlow,一個具備導演-评审-反思機制的閉環相機搜尋智能體。導演模組建構軟性攝影藍圖並提出多樣化的候選相機;评审模組結合規則檢查、視覺評論與成對當前最優選擇;反思模組則將失敗轉化為區域記憶、死區抑制與高探索重新定位。我們同時推出VPhotoBench,一個包含47個開放授權Blender場景與141項語言條件攝影任務的評測基準,涵蓋主體擺放、關係構圖與氛圍/風格。在保留測試中,PhotoFlow在六輪渲染預算下,於一次性預測、單鏈反思、錨點庫選擇與隨機搜尋等方法中,取得了最強的外部品質-一致性複合指標與成功率。據我們所知,這是首項將任意Blender場景中的語言條件虛擬攝影轉化為可執行智能體任務的研究,而我們的結果顯示,以大型語言模型為核心的空間智能體,在一個同時挑戰三維推理與美學選擇的設定下,已能產出優質照片。
English
Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.