感知复杂度测评:面向复杂感知推理的视频基准框架
PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
March 27, 2026
作者: Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian, Zuyan Liu, Yushi Hu, Haoning Wu, Yuhao Dong, Benlin Liu, Ziwei Liu, Ranjay Krishna
cs.AI
摘要
我们推出PerceptionComp——一个专为复杂、长时序、以感知为核心的视频推理任务构建的人工标注基准。该基准的设计理念在于:任何单一时刻的信息都不足以回答问题,每个问题都需要整合多个时间分散的视觉证据,并在联合逻辑与序列逻辑下满足组合约束条件。其内容涵盖物体、属性、关系、位置、动作与事件等感知子任务,要求具备语义识别、视觉对应、时序推理与空间推理等综合能力。该基准包含来自城市漫步导览、别墅室内导览、电子游戏及极限户外运动等多元领域的279段视频,共1,114道高复杂度问题,全部采用人工标注。人类实验表明,PerceptionComp需要大量的实时思考与重复感知步骤:参与者耗时远超现有基准测试,且在禁止回放视频时正确率骤降至接近随机水平(18.97%)。当前最先进的多模态大语言模型在PerceptionComp上的表现也显著落后于现有基准:评估中表现最佳的Gemini-3-Flash模型在五选一设定下仅达到45.96%的正确率,而开源模型均低于40%。这些结果表明以感知为核心的长时序视频推理仍是重要技术瓶颈,我们期待PerceptionComp能推动感知推理领域的突破。
English
We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.