코드-방: 에이전트 코드 합성을 통한 탑뷰 이미지로부터의 3D 방 생성

초록

실용적이고 기능적인 3D 실내 공간을 설계하는 것은 인테리어 디자인, 가상현실, 게임, 그리고 구현 AI 등 다양한 응용 분야에서 필수적이다. 최근 MLLM 기반 접근법은 텍스트 설명이나 참조 이미지로부터 3D 공간을 합성하는 데 큰 잠재력을 보여주었지만, 텍스트 기반 방법은 정확한 공간 정보를 포착하는 데 어려움을 겪고, 기존 이미지 조건화 에이전트는 탑뷰 이미지로부터 전체 공간을 생성하는 작업에서 불안정성과 무한 루프 문제를 겪는다. 이러한 한계를 해결하기 위해, 우리는 구조화된 실행 하네스(harness)를 갖춘 MLLM 기반 에이전트 프레임워크인 Code-as-Room을 제안한다. 이 프레임워크는 3D 공간을 블렌더 코드로 표현한다. 탑뷰 공간 이미지가 주어지면, 프레임워크는 참조 이미지를 파싱하여 장면 요소와 그 공간적 관계를 추출하고, 원칙적인 다단계 파이프라인을 통해 형상, 재질, 조명에 대한 실행 가능한 블렌더 코드를 합성한다. 또한, 기존 에이전트 기반 프레임워크의 고질적인 맥락 망각 문제를 완화하기 위해 교차 단계 메모리 모듈을 유지한다. 우리는 다양한 평가 프로토콜을 포함하는 코드 기반 3D 공간 합성을 위한 전용 벤치마크를 추가로 도입한다. 이 벤치마크를 바탕으로, 기존 에이전트 기반 방법과의 포괄적 비교를 수행하여 제안된 실행 하네스의 효과성을 검증한다.

English

Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, text-based methods struggle to capture precise spatial information, and existing image-conditioned agents suffer from instability and infinite looping when tasked with holistic room generation from top-down views. To address these limitations, we propose Code-as-Room, an MLLM-based agentic framework equipped with a structured execution harness, which represents 3D rooms with Blender codes. Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline. A cross-stage memory module is maintained throughout to mitigate context forgetting inherent to existing agent-based frameworks. We further introduce a dedicated benchmark for code-based 3D room synthesis, encompassing various evaluation protocols. Based on our benchmark, comprehensive comparisons against existing agent-based methods are conducted to validate the effectiveness of our proposed execution harness.