BabyVision:超越语言的视觉推理
BabyVision: Visual Reasoning Beyond Language
January 10, 2026
作者: Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y. charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu, Baobao Chang, Xiaobo Hu, Kaiyuan Chen, Yixin Ren, Yang Liu, Yuan Gong, Kuan Li
cs.AI
摘要
人类在掌握语言之前便已发展出核心视觉能力,而当代多模态大语言模型(MLLMs)仍严重依赖语言先验来弥补其脆弱的视觉理解能力。我们发现一个重要事实:最先进的MLLMs在人类(甚至三岁幼童)能轻松解决的基础视觉任务上持续失败。为系统探究这一差距,我们推出BabyVision基准测试,旨在评估MLLMs独立于语言知识的核心视觉能力。该基准涵盖四大关键类别下的22个子类共388项任务,实证结果与人类评估表明,主流MLLMs表现显著低于人类基线——Gemini3-Pro-Preview得分49.7,落后于六岁儿童水平,与成人94.1的平均分差距悬殊。这些结果揭示:尽管当前MLLMs在知识密集型评估中表现优异,其仍缺乏基础视觉原语能力。BabyVision的进展标志着向人类水平视觉感知与推理能力迈出的一步。我们还通过提出BabyVision-Gen与自动评估工具包探索生成模型解决视觉推理的路径。代码与基准数据已发布于https://github.com/UniPat-AI/BabyVision 以供复现。
English
While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing BabyVision-Gen and automatic evaluation toolkit. Our code and benchmark data are released at https://github.com/UniPat-AI/BabyVision for reproduction.