場景腳本:使用自回歸結構語言模型重建場景
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model
March 19, 2024
作者: Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, Vasileios Balntas
cs.AI
摘要
我們介紹了 SceneScript,這是一種直接以自回歸、基於標記的方法產生完整場景模型的技術。我們提出的場景表示受到了近期在變壓器和LLM中取得的成功的啟發,與更傳統的方法有所不同,後者通常將場景描述為網格、體素網格、點雲或輻射場。我們的方法從編碼的視覺數據中直接推斷出一組結構化的語言命令,使用場景語言編碼器-解碼器架構。為了訓練 SceneScript,我們生成並釋放了一個大規模的合成數據集,名為 Aria Synthetic Environments,包含 10 萬個高質量的室內場景,其中包括逼真的、經地面真實標註的主觀場景漫遊渲染。我們的方法在建築佈局估計方面取得了最先進的結果,並在 3D 物體檢測方面取得了競爭力的結果。最後,我們探討了 SceneScript 的一個優勢,即通過對結構化語言進行簡單添加,能夠迅速適應新的命令,我們將其應用於粗略的 3D 物體部件重建等任務中。
English
We introduce SceneScript, a method that directly produces full scene models
as a sequence of structured language commands using an autoregressive,
token-based approach. Our proposed scene representation is inspired by recent
successes in transformers & LLMs, and departs from more traditional methods
which commonly describe scenes as meshes, voxel grids, point clouds or radiance
fields. Our method infers the set of structured language commands directly from
encoded visual data using a scene language encoder-decoder architecture. To
train SceneScript, we generate and release a large-scale synthetic dataset
called Aria Synthetic Environments consisting of 100k high-quality in-door
scenes, with photorealistic and ground-truth annotated renders of egocentric
scene walkthroughs. Our method gives state-of-the art results in architectural
layout estimation, and competitive results in 3D object detection. Lastly, we
explore an advantage for SceneScript, which is the ability to readily adapt to
new commands via simple additions to the structured language, which we
illustrate for tasks such as coarse 3D object part reconstruction.Summary
AI-Generated Summary