ChatPaper.aiChatPaper

LEGO-Eval:基于工具增强的3D具身环境合成细粒度评估框架

LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

November 4, 2025
作者: Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo
cs.AI

摘要

尽管利用大语言模型(LLM)自动生成三维场景已取得进展,但生成场景常缺乏真实环境中的合理空间布局与物体属性。该问题的根源在于指导指令过于粗略、缺乏细节,因此推进基于反映真实环境的细粒度指令的三维场景合成技术至关重要。若缺乏逼真场景,在非真实环境中训练具身智能体会使其学习到与现实世界物理规律及语义显著偏离的先验知识,进而导致实际部署时性能下降。因此,验证细粒度指令与生成场景之间的对齐性对有效学习至关重要。然而当前评估方法(如CLIPScore和视觉语言模型)往往难以可靠评估这种对齐性,这主要源于其对三维场景的浅层理解,常导致场景要素定位失准。为此,我们提出LEGO-Eval评估框架,该框架配备多样化工具以显式锚定场景要素,从而实现更精准的对齐评估。同时我们推出LEGO-Bench基准数据集,包含针对真实环境复杂布局与属性的细粒度指令集。实验表明,LEGO-Eval在场景-指令对齐评估中的F1分数比VLM-as-a-judge方法高出0.41。基于LEGO-Bench的测试揭示了当前生成方法的显著局限:在所有评估方法中,能完全符合细粒度指令的场景生成成功率最高仅达10%。
English
Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.
PDF462December 1, 2025