Function2Scene: 기능 명세로부터의 3D 실내 장면 레이아웃

초록

대부분의 텍스트 기반 3D 실내 장면 합성 방법은 객체 중심 프롬프트를 기반으로 방을 생성하여, 공간이 어떻게 사용되는지보다 어떤 가구를 배치해야 하는지에 초점을 맞춘다. 그러나 실제 인테리어 디자인에서는 공간이 거주자(예: 그들의 활동과 신체적 요구)를 얼마나 잘 지원하는지에 따라 배치가 평가된다. 본 연구는 기능 명세, 즉 방을 사용할 사람과 그곳에서 수행해야 할 활동을 설명하는 자연어 설계 지시서로부터 3D 실내 배치를 생성하는 프레임워크인 Function2Scene을 소개한다. 이러한 명세가 주어지면, 우리 시스템은 거주자 페르소나와 활동을 분석하고, 공간적, 인간공학적, 활동적, 환경적 고려 사항을 포괄하는 17개 기준의 분류 체계로부터 맞춤형 기능 설계 제약 조건을 도출하며, 이러한 제약 조건을 활용하여 배치 생성을 안내한다. Function2Scene은 최종 장면을 직접 생성하기 위해 LLM에 의존하는 대신, 기하학적 측정, LLM 기반 맥락 추론, VLM 기반 시각 평가를 결합한 도구 강화 검증-수정 루프를 통해 반복적인 평가 및 개선을 수행한다. 전문적으로 작성된 30개의 인테리어 디자인 사례에 대한 실험 결과, Function2Scene은 최근의 LLM 기반 장면 합성 기준선보다 기능적 요구 사항을 더 잘 충족하는 배치를 생성하며, 쌍별 비교의 94.3%에서 우리의 결과가 선호되었다. 본 연구는 텍스트 기반 실내 장면 합성을 그럴듯한 객체 배치에서 인간의 사용을 지원하는 공간 설계로 재구성한다.

English

Most text-driven 3D indoor scene synthesis methods generate rooms from object-centric prompts, asking what furniture should be placed rather than how the space is used. Yet in real interior design, a layout is judged by how well it supports its occupants, e.g., their activities and physical needs. We introduce Function2Scene, a framework for generating 3D indoor layouts from functional specifications, i.e., natural-language design briefs describing who will use a room and what they need to do there. Given such a specification, our system parses occupant personas and activities, derives a customized set of functional design constraints from a taxonomy of 17 criteria spanning spatial, ergonomic, activity, and environmental considerations, and uses these constraints to guide layout generation. Rather than relying on an LLM to directly produce a final scene, Function2Scene performs iterative evaluation and refinement through a tool-augmented check-and-repair loop, combining geometric measurements, LLM-based contextual reasoning, and VLM-based visual assessment. Experiments on 30 professionally written interior-design cases show that Function2Scene produces layouts that better satisfy functional requirements than recent LLM-based scene synthesis baselines, with our results preferred in 94.3% of pairwise comparisons. Our work reframes text-driven indoor scene synthesis from placing plausible objects to designing spaces that support human use.