비디오 LLM 벤치마크 해부: 지식, 공간 인식, 진정한 시간적 이해?

초록

기존의 비디오 이해 벤치마크는 종종 지식 기반 질문과 순수 이미지 기반 질문을 혼동하여, 비디오 이해를 다른 모달리티와 구별하는 핵심 요소인 모델의 시간적 추론 능력을 명확히 분리하지 못합니다. 우리는 더 높은 점수가 진정으로 비디오의 동적 콘텐츠에 대한 강한 이해를 나타내는지 모호하게 만드는 두 가지 주요 한계를 확인했습니다: (1) 강력한 언어 사전 지식으로, 모델이 비디오를 보지 않고도 질문에 답할 수 있는 경우; (2) 셔플링 불변성으로, 비디오 프레임이 시간적으로 섞여 있어도 특정 질문에 대해 모델이 유사한 성능을 유지하는 경우. 이러한 문제를 완화하기 위해, 우리는 VBenchComp를 제안합니다. 이는 질문을 다양한 영역으로 분류하는 자동화된 파이프라인으로, LLM-응답 가능, 의미론적, 시간적 영역으로 구분합니다. 구체적으로, LLM-응답 가능 질문은 비디오를 보지 않고도 답할 수 있는 질문이며, 의미론적 질문은 비디오 프레임이 섞여 있어도 답할 수 있는 질문이고, 시간적 질문은 프레임의 올바른 시간적 순서를 이해해야 하는 질문입니다. 나머지 질문은 기타로 분류됩니다. 이를 통해 비디오 LLM의 다양한 능력을 세밀하게 평가할 수 있습니다. 우리의 분석은 전통적인 전체 점수로는 드러나지 않는 모델의 미묘한 약점을 밝혀내며, 비디오 LLM을 더 정확하게 평가할 수 있는 향후 벤치마크 설계를 위한 통찰과 권장 사항을 제공합니다.

English

Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.

비디오 LLM 벤치마크 해부: 지식, 공간 인식, 진정한 시간적 이해?

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

초록

Support