Flat-Pack 基準：透過家具組裝評估大型視覺語言模型中的時空理解能力

摘要

大型視覺語言模型（LVLMs）的興起顯著推進了影片理解能力。然而，現有基準主要聚焦於粗粒度任務，例如動作分割、分類、字幕生成與檢索。此外，這些基準常依賴於易於透過口語辨識的實體，如家庭物品、動物、人類主體等，從而限制了其在複雜的真實場景影片中的適用性。然而，許多應用（如家具組裝、烹飪等）需要對影片進行逐步的細粒度時空理解，而現有基準對此並未充分評估。為填補此缺口，我們提出Flat-Pack Bench，一個以家具組裝任務為核心的新穎基準。此基準透過結合視覺提示（標記相關零件作為細粒度問題的參考）的多選題，評估LVLMs在細緻任務上的表現，包括組裝動作的時間排序、組裝狀態的時間定位、零件接合理解與追蹤。我們的實驗顯示，最先進的LVLMs在細粒度時空推理上表現顯著不足，突顯其在有效利用影片時間資訊、有限的追蹤能力，以及對物理接觸等空間互動理解上的限制。

English

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.