4D-Bench:用於4D物體理解的多模態大型語言模型基準測試
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
March 22, 2025
作者: Wenxuan Zhu, Bing Li, Cheng Zheng, Jinjie Mai, Jun Chen, Letian Jiang, Abdullah Hamdi, Sara Rojas Martinez, Chia-Wen Lin, Mohamed Elhoseiny, Bernard Ghanem
cs.AI
摘要
多模态大型語言模型(MLLMs)已展現出令人印象深刻的二維圖像/視頻理解能力。然而,目前尚無公開標準化的基準來評估MLLMs在理解四維物體(即隨時間演變的三維物體)方面的能力。本文介紹了4D-Bench,這是首個用於評估MLLMs在四維物體理解能力的基準,包含四維物體問答(4D object QA)和四維物體描述(4D object captioning)任務。4D-Bench提供了多樣類別的四維物體、高質量註釋,以及需要多視角時空理解的任務,這與現有的基於二維圖像/視頻的基準不同。通過4D-Bench,我們評估了多種開源和閉源的MLLMs。四維物體描述實驗的結果表明,MLLMs在時間理解方面普遍弱於外觀理解,值得注意的是,開源模型在外觀理解上接近閉源模型的表現,但在時間理解上則存在較大的性能差距。四維物體問答實驗得出了令人驚訝的發現:即使是簡單的單一物體視頻,MLLMs的表現也較差,最先進的GPT-4o僅達到63%的準確率,而人類基準為91%。這些發現凸顯了四維物體理解方面的巨大差距,以及MLLMs需要進一步改進的必要性。
English
Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D
image/video understanding capabilities. However, there are no publicly
standardized benchmarks to assess the abilities of MLLMs in understanding the
4D objects (3D objects with temporal evolution over time). In this paper, we
introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs
in 4D object understanding, featuring tasks in 4D object Question Answering (4D
object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse
categories, high-quality annotations, and tasks necessitating multi-view
spatial-temporal understanding, different from existing 2D image/video-based
benchmarks. With 4D-Bench, we evaluate a wide range of open-source and
closed-source MLLMs. The results from the 4D object captioning experiment
indicate that MLLMs generally exhibit weaker temporal understanding compared to
their appearance understanding, notably, while open-source models approach
closed-source performance in appearance understanding, they show larger
performance gaps in temporal understanding. 4D object QA yields surprising
findings: even with simple single-object videos, MLLMs perform poorly, with
state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human
baseline of 91\%. These findings highlight a substantial gap in 4D object
understanding and the need for further advancements in MLLMs.Summary
AI-Generated Summary