SceneScript: 자동회귀적 구조화 언어 모델을 이용한 장면 재구성

초록

우리는 SceneScript를 소개합니다. 이 방법은 자동회귀적 토큰 기반 접근법을 사용하여 전체 장면 모델을 구조화된 언어 명령어의 시퀀스로 직접 생성합니다. 우리가 제안한 장면 표현 방식은 트랜스포머와 대형 언어 모델(LLM)의 최근 성공에서 영감을 받았으며, 기존의 메시, 복셀 그리드, 포인트 클라우드 또는 방사 필드로 장면을 표현하는 전통적인 방법과는 차별화됩니다. 우리의 방법은 장면 언어 인코더-디코더 아키텍처를 사용하여 인코딩된 시각 데이터로부터 직접 구조화된 언어 명령어 집합을 추론합니다. SceneScript를 학습시키기 위해, 우리는 10만 개의 고품질 실내 장면으로 구성된 Aria Synthetic Environments라는 대규모 합성 데이터셋을 생성하고 공개합니다. 이 데이터셋은 포토리얼리스틱한 에고센트릭 장면 워크스루 렌더링과 정확한 실측 데이터를 포함합니다. 우리의 방법은 건축 레이아웃 추정에서 최첨단 결과를 보여주며, 3D 객체 탐지에서도 경쟁력 있는 결과를 달성합니다. 마지막으로, 우리는 SceneScript의 장점 중 하나인 구조화된 언어에 간단한 추가를 통해 새로운 명령어에 쉽게 적응할 수 있는 능력을 탐구하며, 이를 통해 대략적인 3D 객체 부품 재구성과 같은 작업을 예시로 보여줍니다.

English

We introduce SceneScript, a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Our proposed scene representation is inspired by recent successes in transformers & LLMs, and departs from more traditional methods which commonly describe scenes as meshes, voxel grids, point clouds or radiance fields. Our method infers the set of structured language commands directly from encoded visual data using a scene language encoder-decoder architecture. To train SceneScript, we generate and release a large-scale synthetic dataset called Aria Synthetic Environments consisting of 100k high-quality in-door scenes, with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. Our method gives state-of-the art results in architectural layout estimation, and competitive results in 3D object detection. Lastly, we explore an advantage for SceneScript, which is the ability to readily adapt to new commands via simple additions to the structured language, which we illustrate for tasks such as coarse 3D object part reconstruction.

SceneScript: 자동회귀적 구조화 언어 모델을 이용한 장면 재구성

SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

초록

Summary

Support

Support