MesaTask：基於三維空間推理的任務驅動桌面場景生成研究

摘要

機器人解讀人類指令並執行操控任務的能力，依賴於具備任務相關的桌面場景以供訓練。然而，傳統創建這些場景的方法依賴於耗時的手動佈局設計或純粹隨機的佈局，這些方法在合理性或與任務的契合度方面存在局限。本文提出了一項新穎任務，即面向任務的桌面場景生成，由於高層次任務指令與桌面場景之間存在顯著差距，該任務面臨重大挑戰。為支持此類具挑戰性任務的研究，我們引入了MesaTask-10K，這是一個大規模數據集，包含約10,700個合成桌面場景，其佈局經過精心設計，確保了場景的真實性及物體間複雜的相互關係。為彌補任務與場景之間的鴻溝，我們提出了一種空間推理鏈，將生成過程分解為物體推斷、空間相互關係推理及最終三維佈局的場景圖構建。我們展示了MesaTask，這是一個基於大語言模型（LLM）的框架，利用此推理鏈，並進一步通過DPO算法增強，以生成與給定任務描述高度契合且物理上合理的桌面場景。詳盡的實驗表明，MesaTask在生成符合任務要求、佈局真實的桌面場景方面，相較於基線方法展現出優異性能。項目頁面位於https://mesatask.github.io/。

English

The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/

MesaTask：基於三維空間推理的任務驅動桌面場景生成研究

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

摘要

Support