MesaTask: 3D空間推論によるタスク駆動型テーブルトップシーン生成に向けて

要旨

ロボットが人間の指示を解釈し、操作タスクを実行する能力を養うためには、タスクに関連したテーブルトップシーンをトレーニング用に用意する必要があります。しかし、従来の方法では、これらのシーンを作成するために時間のかかる手動レイアウト設計や、純粋にランダム化されたレイアウトに依存しており、その妥当性やタスクとの整合性に限界がありました。本論文では、タスク指向のテーブルトップシーン生成という新たなタスクを定式化します。このタスクは、高レベルのタスク指示とテーブルトップシーンの間に大きな隔たりがあるため、非常に困難な課題となります。このような挑戦的なタスクの研究を支援するため、約10,700の合成テーブルトップシーンからなる大規模データセットMesaTask-10Kを導入します。このデータセットは、現実的なレイアウトと複雑なオブジェクト間の関係を確保するために手作業で作成されたレイアウトを含んでいます。タスクとシーンの間の隔たりを埋めるために、生成プロセスをオブジェクト推論、空間的相互関係の推論、最終的な3Dレイアウトのためのシーングラフ構築に分解するSpatial Reasoning Chainを提案します。この推論チェーンを利用し、DPOアルゴリズムでさらに強化されたLLMベースのフレームワークMesaTaskを提示します。これにより、与えられたタスク記述に適した物理的に妥当なテーブルトップシーンを生成します。徹底的な実験により、MesaTaskが現実的なレイアウトでタスクに適合するテーブルトップシーンを生成する点でベースラインを上回る性能を示すことが実証されました。プロジェクトページはhttps://mesatask.github.io/にあります。

English

The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/

MesaTask: 3D空間推論によるタスク駆動型テーブルトップシーン生成に向けて

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

要旨

Support