4D-Bench：用於4D物體理解的多模態大型語言模型基準測試

摘要

多模态大型語言模型（MLLMs）已展現出令人印象深刻的二維圖像/視頻理解能力。然而，目前尚無公開標準化的基準來評估MLLMs在理解四維物體（即隨時間演變的三維物體）方面的能力。本文介紹了4D-Bench，這是首個用於評估MLLMs在四維物體理解能力的基準，包含四維物體問答（4D object QA）和四維物體描述（4D object captioning）任務。4D-Bench提供了多樣類別的四維物體、高質量註釋，以及需要多視角時空理解的任務，這與現有的基於二維圖像/視頻的基準不同。通過4D-Bench，我們評估了多種開源和閉源的MLLMs。四維物體描述實驗的結果表明，MLLMs在時間理解方面普遍弱於外觀理解，值得注意的是，開源模型在外觀理解上接近閉源模型的表現，但在時間理解上則存在較大的性能差距。四維物體問答實驗得出了令人驚訝的發現：即使是簡單的單一物體視頻，MLLMs的表現也較差，最先進的GPT-4o僅達到63%的準確率，而人類基準為91%。這些發現凸顯了四維物體理解方面的巨大差距，以及MLLMs需要進一步改進的必要性。

English

Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks. With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs. The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding. 4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human baseline of 91\%. These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.

4D-Bench：用於4D物體理解的多模態大型語言模型基準測試

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

摘要

Support