场景脚本：使用自回归结构化语言模型重建场景

摘要

我们介绍了SceneScript，这是一种使用自回归、基于标记的方法，直接生成完整场景模型作为结构化语言命令序列。我们提出的场景表示受到了最近在transformers和LLMs中取得的成功的启发，并与更传统的方法有所不同，传统方法通常将场景描述为网格、体素网格、点云或辐射场。我们的方法通过场景语言编码器-解码器架构直接从编码的视觉数据推断出一组结构化语言命令。为了训练SceneScript，我们生成并发布了一个大规模的合成数据集，名为Aria Synthetic Environments，包含10万个高质量室内场景，具有逼真的、地面真实标注的主体场景漫游渲染。我们的方法在建筑布局估计方面取得了最先进的结果，并在3D物体检测方面取得了竞争性的结果。最后，我们探讨了SceneScript的一个优势，即通过简单添加到结构化语言中便能轻松适应新命令的能力，我们以粗糙3D物体部分重建等任务为例进行了说明。

English

We introduce SceneScript, a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Our proposed scene representation is inspired by recent successes in transformers & LLMs, and departs from more traditional methods which commonly describe scenes as meshes, voxel grids, point clouds or radiance fields. Our method infers the set of structured language commands directly from encoded visual data using a scene language encoder-decoder architecture. To train SceneScript, we generate and release a large-scale synthetic dataset called Aria Synthetic Environments consisting of 100k high-quality in-door scenes, with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. Our method gives state-of-the art results in architectural layout estimation, and competitive results in 3D object detection. Lastly, we explore an advantage for SceneScript, which is the ability to readily adapt to new commands via simple additions to the structured language, which we illustrate for tasks such as coarse 3D object part reconstruction.

场景脚本：使用自回归结构化语言模型重建场景

SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

摘要

Support