BabyVision:超越語言的視覺推理
BabyVision: Visual Reasoning Beyond Language
January 10, 2026
作者: Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y. charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu, Baobao Chang, Xiaobo Hu, Kaiyuan Chen, Yixin Ren, Yang Liu, Yuan Gong, Kuan Li
cs.AI
摘要
人類在掌握語言前早已發展出核心視覺能力,然而當代多模態大型語言模型仍嚴重依賴語言先驗來彌補其脆弱的視覺理解能力。我們發現一個關鍵事實:最先進的多模態大語言模型在人類(甚至三歲幼童)能輕鬆解決的基礎視覺任務上持續失敗。為系統性探究此差距,我們推出BabyVision基準測試,旨在評估多模態大語言模型獨立於語言知識的核心視覺能力。BabyVision涵蓋廣泛任務維度,包含4大關鍵類別下的22個子類別,共計388個測試項目。實證結果與人工評估顯示,主流多模態大語言模型表現顯著低於人類基準。Gemini3-Pro-Preview得分僅49.7分,落後於六歲兒童水準,與成人平均94.1分差距懸殊。這些結果表明,儘管現有多模態大語言模型在知識密集型評估中表現優異,其仍缺乏基礎的視覺原語能力。BabyVision的進展代表著向人類級視覺感知與推理能力邁出的重要一步。我們同時提出BabyVision-Gen生成模型框架與自動評估工具包來探索視覺推理的解決方案。相關程式碼與基準數據已開源於https://github.com/UniPat-AI/BabyVision 以供復現研究。
English
While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing BabyVision-Gen and automatic evaluation toolkit. Our code and benchmark data are released at https://github.com/UniPat-AI/BabyVision for reproduction.