ChatPaper.aiChatPaper

LEGO-Eval:面向工具增強型3D具身環境合成的細粒度評估框架

LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

November 4, 2025
作者: Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo
cs.AI

摘要

儘管近期利用大型語言模型(LLM)自動生成3D場景取得進展,但生成場景往往缺乏現實環境中真實的空間佈局與物件屬性。此問題根源在於指導指令過於粗略、細節不足,因此推進以反映真實環境的細粒度指令來引導3D場景合成變得至關重要。若缺乏此類真實場景,在不現實環境中訓練具身智能體會導致其學習到與現實世界物理規律及語義嚴重偏離的先驗知識,從而降低實際部署時的表現。因此,驗證細粒度指令與生成場景間的對齊程度對於有效學習至關重要。然而現有評估方法(如CLIPScore和視覺語言模型)往往難以可靠評測此類對齊關係,主因在於其對3D場景的理解流於表面,常導致場景組件缺乏紮實的基礎定位。為此,我們提出LEGO-Eval評估框架,該框架配備多樣化工具,能顯式錨定場景組件,從而實現更精準的對齊評估。我們同時發布LEGO-Bench基準數據集,包含針對真實環境複雜佈局與屬性的細粒度指令集。實驗表明,LEGO-Eval在場景-指令對齊評估中的F1分數較「VLM即評判」方法提升0.41。通過LEGO-Bench的基準測試,我們發現現有生成方法存在明顯局限:在所有評估方法中,能完全符合細粒度指令的場景生成成功率最高僅達10%。
English
Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.
PDF462December 1, 2025