Flat-Pack Bench: 가구 조립을 통한 대규모 시각-언어 모델의 시공간 이해 평가

초록

대규모 시각-언어 모델(LVLM)의 등장은 비디오 이해 능력을 크게 발전시켰습니다. 그러나 기존 벤치마크는 동작 분할, 분류, 캡셔닝, 검색과 같은 거친 수준의 과제에 주로 초점을 맞추고 있습니다. 더욱이 이러한 벤치마크는 가정용 물건, 동물, 사람 등과 같이 언어적으로 쉽게 식별할 수 있는 개체에 의존하는 경우가 많아, 복잡한 현장 비디오 시나리오에 적용하기 어렵습니다. 그러나 가구 조립, 요리 등과 같은 많은 애플리케이션에서는 비디오에 대한 단계별 세밀한 시공간 이해가 필요하지만, 현재 벤치마크에서는 이를 충분히 평가하지 못하고 있습니다. 이러한 격차를 해소하기 위해, 우리는 가구 조립 과제에 초점을 맞춘 새로운 벤치마크인 Flat-Pack Bench를 소개합니다. 우리의 벤치마크는 조립 동작의 시간적 순서, 조립 상태의 시간적 위치 파악, 부품 결합 이해 및 추적과 같은 미묘한 과제에 대해 LVLM을 평가하며, 세밀한 질문에 대한 참조로 관련 부위를 강조하는 시각적 프롬프트와 함께 객관식 질문을 사용합니다. 실험 결과, 최첨단 LVLM은 세밀한 시공간 추론에 상당히 어려움을 겪으며, 비디오의 시간 정보를 효과적으로 활용하는 능력, 추적 능력의 한계, 물리적 접촉과 같은 공간적 상호작용 이해의 부족을 드러냈습니다.

English

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.