Flat-Pack Bench：家具組み立てを通じた大規模視覚言語モデルにおける時空間理解の評価

要旨

大規模視覚言語モデル（LVLMs）の登場により、動画理解能力は大幅に向上した。しかし、既存のベンチマークは、主に動作分割、分類、キャプション生成、検索といった粗粒度タスクに焦点を当てている。さらに、これらのベンチマークは、家庭用物体、動物、人間などのように言語的に容易に識別できるエンティティに依存することが多く、複雑で実環境の動画シナリオへの適用性が制限されている。一方、家具組み立てや調理など多くの応用では、動画の段階的な細粒度の時空間理解が必要であるが、既存のベンチマークでは十分に評価されていない。このギャップを埋めるため、我々は家具組み立てタスクに特化した新規ベンチマーク「Flat-Pack Bench」を導入する。本ベンチマークは、組み立て動作の時間的順序付け、組み立て状態の時間的局所化、部品の嵌合理解、追跡といった微妙なタスクにおいてLVLMsを評価する。その方法として、多肢選択問題に、細粒度の質問に対する参照として該当部分を強調表示した視覚的プロンプトを組み合わせて用いる。実験の結果、最先端のLVLMsは細粒度の時空間推論に著しく困難を抱えており、動画からの時間情報の効果的な活用の限界、追跡能力の低さ、物理的接触のような空間的相互作用の理解不足が明らかとなった。

English

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.