Flat-Pack基准：通过家具组装评估大型视觉语言模型的时空理解能力

摘要

大型视觉-语言模型的出现显著推动了视频理解能力的发展。然而，现有基准测试主要聚焦于粗粒度任务，如动作分割、分类、描述和检索。此外，这些基准测试往往依赖可通过语言轻松识别的实体（如家居物品、动物、人类主体等），限制了其在复杂、真实场景视频中的适用性。而许多应用（如家具组装、烹饪等）需要逐步骤的细粒度时空视频理解，这在当前基准测试中尚未得到充分评估。为弥补这一空白，我们提出了平板包装基准测试——一个以家具组装任务为核心的新型基准。该基准通过结合视觉提示（突出显示相关部件作为细粒度问题的参考）的多选题，评估大型视觉-语言模型在动作时序排序、组装状态时序定位、部件配合理解及跟踪等精细任务上的表现。实验表明，当前最先进的大型视觉-语言模型在细粒度时空推理方面表现显著困难，凸显了其在有效利用视频时序信息、跟踪能力有限，以及对物理接触等空间交互理解不足的局限。

English

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.