PhotoFlow: 에이전트 기반 3D 가상 사진 촬영 임무

초록

가상 사진은 사전 선택된 카메라 포즈나 참조 이미지 없이 준비된 3D 장면에 에이전트가 진입하여, 장면 정보와 언어적 의도로부터 적절한 구도를 추론하고, 실행 가능한 카메라 매개변수를 선택한 후 최종 사진을 렌더링하도록 요구한다. 최근 시각-언어 모델의 발전으로 이러한 유형의 공간 에이전트는 점점 더 실현 가능해지고 있지만, 이 과제는 함께 평가하기 어려운 두 가지 능력, 즉 복잡한 3D 공간 이해와 추상적 미학적 판단을 강조한다. 우리는 폐루프 카메라 탐색을 위한 감독-평가-반영 에이전트인 PhotoFlow를 소개한다. 감독은 소프트 사진 청사진을 구축하고 다양한 후보 카메라를 제안하며, 평가는 규칙 검사, 시각적 비평 및 쌍별 현행 선택을 결합하고, 반영은 실패를 영역 메모리, 사각지대 억제 및 고탐색 재배치로 전환한다. 또한 47개의 오픈 라이선스 Blender 장면과 주제 배치, 관계적 구성 및 분위기/스타일을 포괄하는 141개의 언어 조건부 사진 촬영 임무로 구성된 벤치마크인 VPhotoBench도 소개한다. 격리 실험에서 PhotoFlow는 6회 렌더링 예산 하에서 단일 예측, 단일 체인 반영, 앵커 뱅크 선택 및 무작위 탐색 중 가장 강력한 외부 품질-정렬 복합 지표와 성공률을 달성했다. 우리가 아는 한, 이는 임의의 Blender 장면에서 언어 조건부 가상 사진을 실행 가능한 에이전트 과제로 만든 첫 번째 연구이며, 우리의 결과는 LLM 중심 공간 에이전트가 3D 추론과 미학적 선택 모두에 도전하도록 설계된 환경에서 이미 강력한 사진을 생성할 수 있음을 보여준다.

English

Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.